Building a GPT-2 Tokenizer in Go

6 minute read

There are easier ways to spend your weekends. Building a GPT-2 tokenizer from scratch is not one of them. But somewhere between reading Karpathy’s “Zero to Hero”, benchmarking Go’s runtime behavior at 2AM, and debating whether my slice reallocation was “morally acceptable”, I fell down a rabbit hole.

This post is the story of how that happened: what I built, what broke, what I optimized, and what I learned about systems engineering by reconstructing one of the most universally used components in modern LLMs: the tokenizer.

The BPE Tokenizer

TODO: add background, and what we are working with (the dataset specifically)

Why Build a Tokenizer at All?

If one’s interested in LLM infrastructure, tokenizers are not optional. They sit at the front of every inference and training pipeline and dictate throughput, correctness, and latency.

I wanted to understand what was actually happening inside. So I rebuilt it - byte-pair encoding, vocab parsing, merges, greedy selection, token mapping, streaming semantics, all of it.

In Go.

Because why not.

The Goal

The goal wasn’t just to encode text.

It was to build:

A streaming-friendly, GPT-2 tokenizer in Go, with exact round-trip parity.

I wanted:

No hidden allocations.
No unnecessary copies.
Streaming support for chunked text.
Benchmarks.

I quickly learned: this is way harder than it sounds.

The Underestimate: BPE is simple, right?

Before this project, I thought tokenization was basically:

Load vocab.
Greedily merge pairs.
Done.

I was wrong.

What’s usually hidden in subtext is that BPE merging is quite convoluted and involves:

priority queues
adjacency maintenance
invalidation semantics
versioning to prevent stale merges
dealing with arbitrary Unicode byte sequences
ensuring determinism
avoiding pathological quadratic behavior
and in streaming mode, dealing with chunk boundaries, which are an entire horror movie of their own

The First Win: Offline Encoder Working

The offline greedy BPE encoder was the first major milestone.

It worked. It matched Hugging Face.
It passed all round-trip tests.
It handled odd Unicode, edge cases, and controlled merges correctly.

This gave me the confidence to start the real challenge.

Streaming.

Where Things Got Spicy: The Streaming Encoder

Streaming is not “offline but in chunks”.

Streaming requires:

maintaining adjacency across chunk boundaries
scheduling merges correctly even when pieces arrive out of order
maintaining liveness invariants
tracking live version numbers
updating linked lists of token nodes
dynamically adjusting a tail-reserve so merges don’t cross uncommitted boundaries
rewriting heap candidates without invalidating active merges

This part of the project consumed weeks.

I rewrote large sections and repeatedly reached states where invariants failed in unpredictable ways.

I hit issues like:

stale heap candidates creating illegal merges
cross-boundary merges misfiring
node liveness drifting out of sync
adjacency pointers failing in ways I wasn’t aware of

Eventually, I stepped back.

I realized something important:

Production tokenizers don’t use fully incremental streaming BPE. The complexity isn’t worth the cost.

That was a turning point.

Streaming BPE is beautiful, but the right move at the time for me was to stop, and optimize the naive streaming encoder instead.

The Optimization Journey

Once the incremental encoder was deprioritized, something magical happened. I could finally treat tokenization like an optimization problem, not a correctness war.

The naive streaming encoder is simple:

break input into chunks
run offline BPE on each chunk
concatenate results

No cross-boundary merging but extremely practical.

And most importantly, much easier to optimize.

Profiling the Naive Streaming Encoder: Where the Time Actually Goes

Benchmark	Iterations	ns/op	MB/s	B/op	allocs/op
NaiveEncodeStreaming_4KBChunks	10	462,254,429	11.34	1,830,937,034	2,343,413

Here’s the CPU flamegraph from the baseline naive streaming encoder (4 KB chunks, single-core).

Naive Streaming Encoder

In the flamegraph above, EncodeOffline dominates CPU time, but the real culprits are two internal operations:

Push/Pop for BucketQueue: responsible for ~35–45% of CPU time and millions of tiny allocations.
mapaccess (pair-rank lookups): another ~20–25%.

Very little time is spent in actual “tokenization.”

Here’s the memory flamegraph from the baseline naive streaming encoder (4 KB chunks, single-core).

Naive Streaming Encoder

~84% of all memory allocated during tokenization comes from constructing the BucketQueue itself. From what I gather,

BucketQueue internally allocates one linked list per rank bucket
GPT-2 BPE has ~50K ranks
So the queue allocates tens of thousands of small slices / structs
This results in millions of small allocations

At this point, I realized that BucketQueue is extremely memory-unfriendly and behaves catastrophically in Go’s allocator model.

Even if we ignore the allocations from constructing the queue (which we shouldn’t given the sheer volume), every push to the queue incurs:

A fresh struct
Repeated pointer chasing
Occasional slice growth in specific buckets

Together these account for another 10% of total memory.

The problem isn’t BPE, it’s the data structure.

Optimization #1: Preallocating Scratch Buffers

The first hotspot I decided to tackle wasn’t even inside the merge loop - it was before the merge loop. Every call to the encoder rebuilt a bunch of internal working buffers (tokens, prev, next, live) and Go helpfully zeroed all of them. Zeroing memory is fast but not free. And when you’re doing it for thousands of tokens per chunk, that work done starts adding up.

The idea with this optimization was to reuse the scratch buffers between invocations and stop Go from silently zeroing everything unless absolutely necessary.

We created two preparation paths:

sc.prepare(n)         // baseline: full slice zero-init
sc.prepareNoZero(n)   // optimization

The important detail: we still intentionally zero liveVersion, because merge-candidate liveness depends on it. But the other slices (tokens, prev, next) don’t need clearing; we overwrite them entirely in the hot loop anyway. So this optimization reduces four zeroing passes down to one.

Benchmark Results

Before optimization

453,708,450 ns/op
11.56 MB/s
1830935921 B/op
2343410 allocs/op

After optimization

438,773,696 ns/op
11.95 MB/s
1830929172 B/op
2343386 allocs/op

The delta

~3.3% faster runtime
~3.4% higher throughput
Allocation counts: basically unchanged
B/op: unchanged (as expected)

Optimization #2: Reusing the Output Buffer and Eliminating the Final Copy

Similar to the scratch buffer overhead, the next hotspot wasn’t inside the merge loop; it was at the tail of the algorithm, where we package and return the tokens. The naive version of the streaming encoder ended with this pattern:

out := make([]int, 0, n)
for i := head; i != -1; i = next[i] {
    out = append(out, tokens[i])
}
return out

There are two main issues at play here.

A fresh buffer allocation every time Every chunk, we:
- Allocate a brand-new slice
- Grow it via appends
- Touch memory that the GC now has to track

This is wasteful, because for streaming workloads the shape of the output is predictable. A 4 KB chunk is always going to produce output in the same ballpark. There’s no reason to allocate a new buffer each time when the old one works just fine.

So we replaced this with a reusable buffer:

st.outBuf = st.outBuf[:0]
st.outBuf = append(st.outBuf, tokens...)

If the preallocated capacity is large enough (we picked 64K), we avoid:

new allocations
slice growth
GC pressure

A full linear copy of the final tokens Even after we were done with all merges, we still copied the token sequence into a new slice before returning it. That’s a full linear pass over the data that adds nothing but latency.

So we introduced a simple switch:

if st.OptNoCopyReturn {
    return st.outBuf      // zero-copy return
}
out := make([]int, len(st.outBuf))
copy(out, st.outBuf)
return out

When OptNoCopyReturn is enabled, we skip the copy entirely and return a slice header that points directly into the reusable buffer.

For streaming workloads where the consumer immediately processes the tokens, this is perfectly safe and much faster.

Benchmark Results

Before (after opt-1):

438,773,696 ns/op
11.95 MB/s
1830929172 B/op
2343386 allocs/op

After (opt-1 + outBuf reuse + no-copy return):

424,314,562 ns/op
12.36 MB/s
1799468104 B/op
2342093 allocs/op

The delta

- ~6.5% faster
- ~31 MB less memory allocated per encode
- ~1,300 fewer allocations per encode

Share on

X Facebook LinkedIn Bluesky

Building a GPT-2 Tokenizer in Go

The BPE Tokenizer

Why Build a Tokenizer at All?

The Goal

The Underestimate: BPE is simple, right?

The First Win: Offline Encoder Working

Where Things Got Spicy: The Streaming Encoder

The Optimization Journey

Profiling the Naive Streaming Encoder: Where the Time Actually Goes

Optimization #1: Preallocating Scratch Buffers

Benchmark Results

Optimization #2: Reusing the Output Buffer and Eliminating the Final Copy

Benchmark Results

Share on

You May Also Enjoy

Client-Side Service Discovery