Building a GPT-2 Tokenizer in Go

A tokenizer is the first step in almost every modern language model pipeline. Its job is simple to state: convert raw text into a sequence of integer IDs that a model can process. For example: "Hello world!" can be encoded as [15496, 995, 0]. These integers correspond to entries in a fixed vocabulary learned during model training.

Every request to a large language model whether for inference or training, goes through a tokenizer first. That makes tokenization part of the critical path for latency-sensitive systems like inference servers, so tokenization being slow means everything downstream is too. Most modern LLMs (GPT-2, GPT-3, GPT-4 class models) use Byte Pair Encoding (BPE) or close variants.

At a high level, BPE works like this:

Start with raw text as bytes (values 0–255)
Repeatedly merge the most frequent adjacent byte/token pairs
Each merge produces a new token
Continue until no more applicable merges exist

The result is a vocabulary of:

single bytes
short character sequences
common substrings (e.g. “ing”, “tion”, “http”)

BPE handles arbitrary UTF-8 text and keeps the vocabulary compact. The catch is at runtime: encoding a chunk means running the merge loop rather than reading from a table. For each chunk, the tokenizer repeatedly:

scan adjacent token pairs
check whether (a, b) is mergeable
compare merge ranks
apply the best merge
update neighboring pairs

All of this usually runs to thousands of pair lookups per chunk of text.

The Goal

Most production tokenizers are treated as black boxes, but that abstraction leaks quickly once we start caring about end-to-end latency and memory patterns.

I wanted to understand what was actually happening inside. The goal was a streaming-friendly GPT-2 tokenizer in Go: exact round-trip parity, minimal allocations, no unnecessary copies.

BPE in Practice

BPE looks simple on the surface (load vocab, greedily merge pairs), but a real implementation has to handle:

a priority queue of merge candidates
keeping adjacency state consistent as merges fire
arbitrary Unicode byte sequences
chunk boundaries in streaming mode

The First Win: Offline Encoder Working

The offline greedy BPE encoder came first. It matched Hugging Face’s output, passed all round-trip tests, and handled odd unicode and edge cases correctly, so the next step was streaming.

The Streaming Encoder

Streaming requires maintaining adjacency across chunk boundaries, scheduling merges correctly even when pieces arrive out of order, and rewriting heap candidates without invalidating active merges.

In practice that meant fighting stale heap candidates that created illegal merges, cross-boundary merges misfiring, and adjacency pointers drifting out of sync.

I dropped the incremental approach since the complexity was outpacing the benefit, and focused on optimizing the naive streaming encoder instead.

The Optimization Journey

The naive streaming encoder is simple:

break input into chunks
run offline BPE on each chunk
concatenate results

It does no cross-boundary merging, but it’s practical and much easier to optimize.

A short detour on the benchmarking setup

All measurements in this post come from a single Go benchmark designed to reflect a realistic streaming tokenization workload.

At a high level, the benchmark:

feeds the encoder 4 KB chunks of input text
runs the streaming encoder end-to-end on each chunk
measures wall-clock time, throughput, allocations, and allocation volume

Each benchmark was run:

with GOMAXPROCS=1 to isolate single-core behavior
using Go’s built-in benchmarking harness (testing.B)
with CPU and memory profiles collected via pprof

Profiling the Naive Streaming Encoder: Where the Time Actually Goes

Benchmark	Iterations	Time (ns/op)	Throughput (MB/s)	Bytes/op	Allocs/op
BenchmarkNaiveEncodeStreaming_4KBChunks	10	540,239,550	9.70	1,830,942,153	2,343,402

Here’s the CPU flamegraph from the baseline naive streaming encoder (4 KB chunks, single-core).

Naive Streaming Encoder

In the flamegraph above, EncodeOffline dominates CPU time, but the real culprits are two internal operations:

Push/Pop for BucketQueue: responsible for ~35–45% of CPU time and millions of tiny allocations.
mapaccess (pair-rank lookups): another ~20–25%.

Hash-map lookups and allocation-heavy queue maintenance dominate the merge loop, leaving very little time for the actual merge work.

Here’s the memory flamegraph from the baseline naive streaming encoder (4 KB chunks, single-core).

Naive Streaming Encoder

~84% of all memory allocated during tokenization comes from constructing the BucketQueue itself. From what I gather,

BucketQueue internally allocates one linked list per rank bucket
GPT-2 BPE has ~50K ranks
So the queue allocates tens of thousands of small slices / structs
This results in millions of small allocations

At this point, I realized that BucketQueue is a poor fit for Go’s allocator model. Even if we ignore the allocations from constructing the queue (which we shouldn’t given the sheer volume), every push to the queue incurs:

A fresh struct
Repeated pointer chasing
Occasional slice growth in specific buckets

Together these account for another 10% of total memory, almost all of it from the queue rather than from BPE.

Optimization #1: FastLookup

In the naive implementation, every time the encoder considers merging two adjacent tokens (a, b), it performs a lookup into a Go map keyed by (a << 32) | b. In a streaming BPE encoder, this happens thousands of times per 4KB chunk. Instead of hashing (a, b) on every check, we can trade a small, fixed amount of memory for a direct indexed lookup. We introduce a dense 2D lookup table for common merge pairs:

fastLookup[a][b] -> packed merge info

At runtime, lookup becomes:

if a < N && b < N {
    info := fastLookup[a][b]
    if info != sentinel {
        return info
    }
}
return fallbackMap[key]

The fast lookup table is sized (N, N). Increasing N increases the hit rate of the fast path and reduces fallback map lookups and in fact, larger values did produce additional speedups in experiments. The point here is the pattern itself: replacing hash-based lookups in the hot loop with bounded, direct memory access. N = 2048 strikes a reasonable balance for illustrating the idea without introducing excessive memory overhead.

Here’s the CPU flamegraph after the first optimization of the naive streaming encoder (4 KB chunks, single-core).

Naive Streaming Encoder with Opt 1

Benchmark Results

Before:

540,239,550 ns/op
9.70 MB/s
1830942153 B/op
2343402 allocs/op

After:

383,143,150 ns/op
13.68 MB/s
1830926814 B/op
2343350 allocs/op

The Delta:

~41% faster runtime
~41% higher throughput
Allocation count: unchanged
B/op: unchanged (expected)

Optimization #2: Preallocating Scratch Buffers

After addressing pair lookup in the hot loop, the next source of overhead showed up before the merge loop even begins. Each call to the encoder rebuilds a set of internal scratch buffers: tokens, prev, next, liveVersion.

In the naive implementation, these slices are freshly prepared on every invocation, and Go helpfully zero-initializes all of them. Zeroing memory is fast, but not free and, in our case, it showed up as a small, but measurable, component of per-chunk overhead once the larger hotspots were removed.

The key observation here is that most of these buffers do not actually need to be cleared. During encoding:

tokens, prev, and next are fully overwritten in the hot loop
only liveVersion relies on a known initial state, since merge-candidate liveness depends on it

and this makes it possible to reuse scratch buffers across invocations and avoid unnecessary zeroing. To optimize,we introduce two preparation paths:

sc.prepare(n)         // baseline: full slice zero-init
sc.prepareNoZero(n)   // optimization

Benchmark Results

Before:

383,143,150 ns/op
13.68 MB/s
1830926814 B/op
2343350 allocs/op

After:

380,274,475 ns/op
13.79 MB/s
1830930540 B/op
2343360 allocs/op

The Delta:

- ~1% faster runtime
- ~1% higher throughput
- Allocation count: unchanged
- B/op: unchanged (expected)

Naive Streaming Encoder with Opt 2

The CPU flamegraph shows a small reduction in runtime overhead associated with slice preparation and memory clearing.

This particular optimization doesn’t move the needle nearly as much as eliminating hash-map lookups, but it removes unnecessary work from the hot path for essentially no added complexity.

Optimization #3: Reusing the Output Buffer and Eliminating the Final Copy

Similar to the scratch buffer overhead, the next hotspot wasn’t inside the merge loop; it was at the tail of the algorithm, where we package and return the tokens. The naive version of the streaming encoder ended with this pattern:

out := make([]int, 0, n)
for i := head; i != -1; i = next[i] {
    out = append(out, tokens[i])
}
return out

There are two main issues at play here.

1. A fresh buffer allocation every time

On every chunk, we:

Allocate a brand-new slice
Grow it via appends
Touch memory that the GC now has to track

This is wasteful, because for streaming workloads the shape of the output is predictable. A 4 KB chunk is always going to produce output in the same ballpark. There’s no reason to allocate a new buffer each time when the old one works just fine.

So we replaced this with a reusable buffer:

st.outBuf = st.outBuf[:0]
st.outBuf = append(st.outBuf, tokens...)

If the preallocated capacity is large enough (we picked 64K), we avoid:

new allocations
slice growth
GC pressure

2. A full linear copy of the final tokens

Even after we were done with all merges, we still copied the token sequence into a new slice before returning it. That’s a full linear pass over the data that adds nothing but latency.

So we introduced a simple switch:

if st.OptNoCopyReturn {
    return st.outBuf      // zero-copy return
}
out := make([]int, len(st.outBuf))
copy(out, st.outBuf)
return out

For streaming workloads where the consumer immediately processes the tokens, this is perfectly safe and much faster.

Naive Streaming Encoder with Opt 3: Memory Flamegraph

The memory flamegraph shows a clear reduction in allocation pressure at the tail of the encoding pipeline, reflecting the elimination of per-chunk output allocations and copies.

Naive Streaming Encoder with Opt 3: CPU Flamegraph

The CPU flamegraph shows a small reduction in runtime overhead associated with allocation, slice growth, and memory copying. The core merge logic remains unchanged.

Benchmark Results

Before:

380,274,475 ns/op
13.79 MB/s
1830930540 B/op
2343360 allocs/op

After:

375,547,250 ns/op
13.96 MB/s
1799469576 B/op
2342068 allocs/op

The Delta:

- ~1.2% faster runtime
- ~31 MB less memory allocated per encode
- ~1,300 fewer allocations per encode

Putting it all together

Starting from a naive streaming BPE encoder, we incrementally applied three optimizations, each targeting a different class of overhead:

FastLookup routed most merge-pair lookups through a bounded 2D table, with a fallback map for the remaining cases.
Scratch buffer reuse trimmed redundant setup work before encoding begins.
Output buffer reuse and zero-copy return eliminated avoidable allocations and copying at the tail.

Overall, compared to the baseline:

~44% faster runtime
~44% higher throughput
~31 MB less memory allocated per encode
~1,300 fewer allocations per encode

Over the next few posts, I’ll zoom out from tokenization and look at KV caches and inference-time optimizations, and how those systems interact with tokenization in practice.

The Goal

BPE in Practice

The First Win: Offline Encoder Working

The Streaming Encoder

The Optimization Journey

A short detour on the benchmarking setup

Profiling the Naive Streaming Encoder: Where the Time Actually Goes

Optimization #1: FastLookup

Benchmark Results

Optimization #2: Preallocating Scratch Buffers

Benchmark Results

Optimization #3: Reusing the Output Buffer and Eliminating the Final Copy

1. A fresh buffer allocation every time

2. A full linear copy of the final tokens

Benchmark Results

Putting it all together

Enjoy Reading This Article?