O(n) Streaming: Optimizing LLM JSON Parsing in Go

A deep dive into building a zero-allocation, resumable JSON streaming parser for LLM applications

go performance llm streaming state-machines

The Problem: The Hidden O(n²) Bottleneck in Streaming

When we talk about LLM “streaming,” we’re usually receiving data in small chunks via SSE (Server-Sent Events). The standard approach to handling this in Go often looks something like this:

  1. Receive a new chunk.
  2. Append it to an accumulator buffer.
  3. Call json.Unmarshal(accumulator, &target).
  4. Update the UI/state.

While simple, this pattern hides a performance disaster. On every new chunk, the library re-scans the entire accumulated response:

Chunk 1: {"content": "H"           → Parse 17 bytes
Chunk 2: {"content": "He"          → Parse 18 bytes  
Chunk 3: {"content": "Hel"         → Parse 19 bytes
...
Chunk N: {"content": "Hello..."}   → Parse 17+N bytes

For a 10KB response delivered in 100 chunks, you aren’t just parsing 10KB; you’re parsing the data redundantly, leading to O(n²) complexity relative to the number of packets and total size—approximately 5,000 redundant parse operations.

The Solution: A Resumable State Machine

To solve this, I built go-llm-stream. Instead of re-parsing, it uses a byte-level state machine that processes each byte exactly once.

State Machine Architecture

The core insight is that JSON parsing can be expressed as a finite state machine with 27 distinct states. Here’s the high-level transition logic:

'"'
'-'
'0'
'1-9'
't'
'f'
'n'
'\\'
valid escape
'"'
'.'
digit
'e/E'
digit
non-digit
BeginValue
InString
Neg
State0
State1
StateT
StateF
StateN
InStringEsc
EndValue
StateDot
StateDot0
StateE
StateE0

The 27 States Explained

The scanner tracks exactly where parsing stopped, organized into logical groups:

// From scanner/state.go - All 27 states mapped from Go stdlib
const (
    // Value Entry States
    StateBeginValue         State = iota // Start of any value
    StateBeginValueOrEmpty               // After '[', expecting value or ']'
    StateBeginStringOrEmpty              // After '{', expecting key or '}'
    StateBeginString                     // Expecting object key

    // String Parsing States (6)
    StateInString      // Inside string literal
    StateInStringEsc   // After '\' in string
    StateInStringEscU  // After '\u'
    StateInStringEscU1 // After '\uX'
    StateInStringEscU2 // After '\uXX'
    StateInStringEscU3 // After '\uXXX'

    // Number Parsing States (8)
    StateNeg   // After '-'
    State0     // After leading '0'
    State1     // After non-zero digit
    StateDot   // After decimal point '.'
    StateDot0  // After decimal digits
    StateE     // After 'e' or 'E'
    StateESign // After exponent sign
    StateE0    // After exponent digits

    // Literal Parsing States - true/false/null
    StateT, StateTr, StateTru           // t → tr → tru → true
    StateF, StateFa, StateFal, StateFals // f → fa → fal → fals → false
    StateN, StateNu, StateNul            // n → nu → nul → null

    // Completion States
    StateEndValue // After completing a value
    StateEndTop   // After top-level value complete
    StateError    // Error state (terminal)
)

How Resumability Works

When a chunk arrives, the scanner picks up exactly where it halted—mid-string, mid-number, or mid-object—and continues:

// The scanner "remembers" where it left off
for token := range healer.Tokens() {
    process(token)
}

This transforms the complexity from O(n²) to linear O(n).

Engineering for Zero Allocations

Performance in Go isn’t just about algorithmic complexity; it’s about the Garbage Collector (GC). Every allocation is a potential GC pause. go-llm-stream makes extensive use of sync.Pool to recycle buffers and minimize heap allocations.

Benchmark Results

Here are the actual benchmark results from the library (tested on AMD Ryzen 5 7600X):

BenchmarkOperationsns/opMB/sB/opallocs/op
ScannerSmall13,636,11385.37316.2800
ScannerMedium131,6808,481318.6000
Closer_Feed6,425,085180.5448.8400
StripMarkdown1,709,116700.4115.6400

Key Insight: The core byte-level scanner achieves zero allocations and processes JSON at over 316 MB/s, validating the O(n) design goal.

Memory Stability Under Load

The stress test processes 100,000 JSON objects (800,001 tokens) with:

This demonstrates constant memory overhead regardless of input size.

Auto-Healing: The “Magic” Sauce

One common issue with LLM outputs is that they can be cut off due to token limits or network errors, leaving you with invalid JSON like:

{"status": "succe

Before: Broken JSON

// Traditional approach - CRASHES
resp := `{"status": "succe`
var result map[string]any
json.Unmarshal([]byte(resp), &result) // ❌ Error: unexpected end of JSON input

After: Auto-Healed JSON

// With go-llm-stream Healer
healer := stream.NewHealer(ctx, strings.NewReader(resp))
for token := range healer.Tokens() {
    // ✅ Outputs valid tokens, auto-closes the string and object
}
// Result: {"status": "succe"}  ← Valid JSON!

The Healer maintains a stack of open delimiters ({, [, ") and automatically closes them when the stream ends prematurely:

Yes
No
Input Stream
Markdown Filter
Scanner
Closer Stack
Stream Complete?
Emit Closure Tokens
Continue Parsing
Valid JSON Output

Healer Configuration Options

// From healer/healer.go
type HealerOptions struct {
    StripMarkdown      bool // Filter ```json code blocks
    AutoClose          bool // Auto-close unclosed containers
    IgnoreTrailingJunk bool // Ignore content after root closes
    CompleteStrings    bool // Close unterminated strings
    CompleteLiterals   bool // Complete partial true/false/null
}

Comparison with Standard Library

The encoding/json package’s streaming approach accumulates the entire buffer and re-parses on each chunk. For a 1MB streaming response:

ApproachParse OperationsComplexity
encoding/json~500 full parses (2KB chunks)O(n²)
go-llm-stream1 incremental parseO(n)

Real-World Usage

import "github.com/camilbenameur/go-llm-stream/stream"

// Connect to LLM API with SSE streaming
resp, _ := http.Get("https://api.openai.com/v1/chat/completions")
defer resp.Body.Close()

// Automatically fix truncated JSON and strip markdown
healer := stream.NewHealer(ctx, resp.Body)
defer healer.Close()

for token := range healer.Tokens() {
    // Process tokens as they arrive with zero re-parsing
    fmt.Print(token.Value)
}

Conclusion: Making Infrastructure Invisible

Good infrastructure should be invisible. By optimizing the parsing layer, we reduce CPU usage and latency, making LLM-driven applications feel more responsive.

The key engineering decisions that made this possible:

  1. State Machine Design: 27 enumerated states replace function pointers for resumability
  2. Zero-Allocation Hot Path: sync.Pool for all buffer recycling
  3. Defensive Healing: Stack-based delimiter tracking for robust error recovery
  4. O(n) Guarantee: Each byte processed exactly once

If you’re building Go applications that rely on high-volume LLM streaming, check out the project on GitHub: go-llm-stream


I’m currently exploring the intersection of distributed systems and AI infrastructure. Feel free to connect with me on LinkedIn to discuss technical architecture and engineering challenges.

CB

Camil Benameur

Software Engineer exploring distributed systems and AI infrastructure

← Back to all posts