Building a GPT-2 Tokenizer in Go
There are easier ways to spend your weekends. Building a GPT-2 tokenizer from scratch is not one of them. But somewhere between reading Karpathy’s “Zero to Hero”, benchmarking Go’s runtime behavior at 2AM, and debating whether my slice reallocation was “morally acceptable”, I fell down a rabbit hole.
This post is the story of how that happened: what I built, what broke, what I optimized, and what I learned about systems engineering by reconstructing one of the most universally used components in modern LLMs: the tokenizer.
The BPE Tokenizer
TODO: add background, and what we are working with (the dataset specifically)
Why Build a Tokenizer at All?
If one’s interested in LLM infrastructure, tokenizers are not optional. They sit at the front of every inference and training pipeline and dictate throughput, correctness, and latency.
I wanted to understand what was actually happening inside. So I rebuilt it - byte-pair encoding, vocab parsing, merges, greedy selection, token mapping, streaming semantics, all of it.
In Go.
Because why not.
The Goal
The goal wasn’t just to encode text.
It was to build:
A streaming-friendly, GPT-2 tokenizer in Go, with exact round-trip parity.
I wanted:
- No hidden allocations.
- No unnecessary copies.
- Streaming support for chunked text.
- Benchmarks.
I quickly learned: this is way harder than it sounds.
The Underestimate: BPE is simple, right?
Before this project, I thought tokenization was basically:
- Load vocab.
- Greedily merge pairs.
- Done.
I was wrong.
What’s usually hidden in subtext is that BPE merging is quite convoluted and involves:
- priority queues
- adjacency maintenance
- invalidation semantics
- versioning to prevent stale merges
- dealing with arbitrary Unicode byte sequences
- ensuring determinism
- avoiding pathological quadratic behavior
- and in streaming mode, dealing with chunk boundaries, which are an entire horror movie of their own
The First Win: Offline Encoder Working
The offline greedy BPE encoder was the first major milestone.
It worked. It matched Hugging Face.
It passed all round-trip tests.
It handled odd Unicode, edge cases, and controlled merges correctly.
This gave me the confidence to start the real challenge.
Streaming.
Where Things Got Spicy: The Streaming Encoder
Streaming is not “offline but in chunks”.
Streaming requires:
- maintaining adjacency across chunk boundaries
- scheduling merges correctly even when pieces arrive out of order
- maintaining liveness invariants
- tracking live version numbers
- updating linked lists of token nodes
- dynamically adjusting a tail-reserve so merges don’t cross uncommitted boundaries
- rewriting heap candidates without invalidating active merges
This part of the project consumed weeks.
I rewrote large sections and repeatedly reached states where invariants failed in unpredictable ways.
I hit issues like:
- stale heap candidates creating illegal merges
- cross-boundary merges misfiring
- node liveness drifting out of sync
- adjacency pointers failing in ways I wasn’t aware of
Eventually, I stepped back.
I realized something important:
Production tokenizers don’t use fully incremental streaming BPE. The complexity isn’t worth the cost.
That was a turning point.
Streaming BPE is beautiful, but the right move at the time for me was to stop, and optimize the naive streaming encoder instead.
The Optimization Journey
Once the incremental encoder was deprioritized, something magical happened. I could finally treat tokenization like an optimization problem, not a correctness war.
The naive streaming encoder is simple:
- break input into chunks
- run offline BPE on each chunk
- concatenate results
No cross-boundary merging but extremely practical.
And most importantly, much easier to optimize.
Profiling the Naive Streaming Encoder: Where the Time Actually Goes
| Benchmark | Iterations | ns/op | MB/s | B/op | allocs/op |
|---|---|---|---|---|---|
| NaiveEncodeStreaming_4KBChunks | 10 | 462,254,429 | 11.34 | 1,830,937,034 | 2,343,413 |
Here’s the CPU flamegraph from the baseline naive streaming encoder (4 KB chunks, single-core).

In the flamegraph above, EncodeOffline dominates CPU time, but the real culprits are two internal operations:
- Push/Pop for BucketQueue: responsible for ~35–45% of CPU time and millions of tiny allocations.
- mapaccess (pair-rank lookups): another ~20–25%.
Very little time is spent in actual “tokenization.”
Here’s the memory flamegraph from the baseline naive streaming encoder (4 KB chunks, single-core).

~84% of all memory allocated during tokenization comes from constructing the BucketQueue itself. From what I gather,
- BucketQueue internally allocates one linked list per rank bucket
- GPT-2 BPE has ~50K ranks
- So the queue allocates tens of thousands of small slices / structs
- This results in millions of small allocations
At this point, I realized that BucketQueue is extremely memory-unfriendly and behaves catastrophically in Go’s allocator model.
Even if we ignore the allocations from constructing the queue (which we shouldn’t given the sheer volume), every push to the queue incurs:
- A fresh struct
- Repeated pointer chasing
- Occasional slice growth in specific buckets
Together these account for another 10% of total memory.
The problem isn’t BPE, it’s the data structure.
Optimization #1: Preallocating Scratch Buffers
The first hotspot I decided to tackle wasn’t even inside the merge loop - it was before the merge loop. Every call to the encoder rebuilt a bunch of internal working buffers (tokens, prev, next, live) and Go helpfully zeroed all of them. Zeroing memory is fast but not free. And when you’re doing it for thousands of tokens per chunk, that work done starts adding up.
The idea with this optimization was to reuse the scratch buffers between invocations and stop Go from silently zeroing everything unless absolutely necessary.
We created two preparation paths:
sc.prepare(n) // baseline: full slice zero-init
sc.prepareNoZero(n) // optimization
The important detail: we still intentionally zero liveVersion, because merge-candidate liveness depends on it. But the other slices (tokens, prev, next) don’t need clearing; we overwrite them entirely in the hot loop anyway. So this optimization reduces four zeroing passes down to one.
Benchmark Results
Before optimization
453,708,450 ns/op
11.56 MB/s
1830935921 B/op
2343410 allocs/op
After optimization
438,773,696 ns/op
11.95 MB/s
1830929172 B/op
2343386 allocs/op
The delta
~3.3% faster runtime
~3.4% higher throughput
Allocation counts: basically unchanged
B/op: unchanged (as expected)
Optimization #2: Reusing the Output Buffer and Eliminating the Final Copy
Similar to the scratch buffer overhead, the next hotspot wasn’t inside the merge loop; it was at the tail of the algorithm, where we package and return the tokens. The naive version of the streaming encoder ended with this pattern:
out := make([]int, 0, n)
for i := head; i != -1; i = next[i] {
out = append(out, tokens[i])
}
return out
There are two main issues at play here.
- A fresh buffer allocation every time
Every chunk, we:
- Allocate a brand-new slice
- Grow it via appends
- Touch memory that the GC now has to track
This is wasteful, because for streaming workloads the shape of the output is predictable. A 4 KB chunk is always going to produce output in the same ballpark. There’s no reason to allocate a new buffer each time when the old one works just fine.
So we replaced this with a reusable buffer:
st.outBuf = st.outBuf[:0]
st.outBuf = append(st.outBuf, tokens...)
If the preallocated capacity is large enough (we picked 64K), we avoid:
- new allocations
- slice growth
- GC pressure
- A full linear copy of the final tokens Even after we were done with all merges, we still copied the token sequence into a new slice before returning it. That’s a full linear pass over the data that adds nothing but latency.
So we introduced a simple switch:
if st.OptNoCopyReturn {
return st.outBuf // zero-copy return
}
out := make([]int, len(st.outBuf))
copy(out, st.outBuf)
return out
When OptNoCopyReturn is enabled, we skip the copy entirely and return a slice header that points directly into the reusable buffer.
For streaming workloads where the consumer immediately processes the tokens, this is perfectly safe and much faster.
Benchmark Results
Before (after opt-1):
438,773,696 ns/op
11.95 MB/s
1830929172 B/op
2343386 allocs/op
After (opt-1 + outBuf reuse + no-copy return):
424,314,562 ns/op
12.36 MB/s
1799468104 B/op
2342093 allocs/op
The delta
- ~6.5% faster
- ~31 MB less memory allocated per encode
- ~1,300 fewer allocations per encode