Building an LSM Storage Engine from Scratch in Go, Part 4

Parts 1 through 3 laid down the pieces: a skiplist memtable, a WAL with group commit, bloom filters, and leveled compaction across per-level directories. Part 4 is where we check whether our design decisions hold up under load, and whether we actually hit the 10K write QPS target set in Part 1.

The Benchmark Setup

The obvious way to benchmark here is to spin up N client goroutines, each in a loop: send a request, wait for the response, record the latency, repeat. This is a closed-loop setup, and it systematically lies about latency under load.

The problem is coordinated omission (when the server slows down, the client slows down with it). If a request takes 500ms instead of the expected 5ms, the client waits 500ms before firing the next one. A real caller would have already queued up the next batch of requests in that window, but the benchmark never sees them, so the latency percentiles only reflect requests the server managed to serve (Scylla has a great writeup on this).

The fix is an open-loop load generator that fires requests at a fixed rate regardless of whether previous ones completed. The setup here is a single goroutine with a ticker: every 1ms it pushes N tokens into a buffered channel, where each token carries the timestamp at which it should have fired. Worker goroutines pull tokens off the channel and fire a gRPC call each. Latency is measured as time.Since(token.ts), so it includes the time the token spent sitting in the channel.

token (with timestamp)

load gen tick

in-flight (gRPC)

The animation above shows how the benchmark setup works. Under light load the channel stays empty and latency is just service time; under overload it fills up, tokens accumulate wait time, and the numbers reflect the full delay a caller would see.

All the numbers below come from an Apple M2 laptop running Go 1.21, a single node with no replication, 19-byte keys, and 256-byte random values. The WAL uses a 5ms group commit tick.

Throughput vs Latency

With a reasonable measurement harness, we can finally answer: how much load can the engine take without blasting through our latency objectives? Part 1 set a target of 10K write QPS, and a 20ms p99 is a reasonable latency budget for a KV write. The benchmark goes through target rates from 5K to 100K writes per second (in 5K steps), measuring p50, p99, and p999 at each step.

Target rate: 15K/s

p50

6.1ms

p99

9.4ms

p999

42.9ms

p50

p99

p999

20ms SLO

p99 stays under 20ms up to about 30K writes per second, clearing the Part 1 target by roughly 3x (caveat: no replication in the loop; adding that is future work). Past that, the latency climbs into hundreds of milliseconds, and by 90K even p50 is past 100ms. This is where the open-loop methodology earns its keep: the channel saturates, tokens accumulate real wait time, and the numbers reflect the full pain a caller would feel.

Wrapping Up

The LSM storage engine is a foundation, and a pretty cool one at that (for me, personally, crash recovery was fun to hash out). The benchmarks raise many questions: what does compaction cost us in write amplification? How does the read path hold up under load? But, those are stories for another day.

Code is on GitHub.

The Benchmark Setup

Throughput vs Latency

Wrapping Up

Enjoy Reading This Article?