-
Preempting the Prefill, Part 3: Results & Benchmark
Paired-trial benchmarks of FlowPrefill on Llama 70B, P/D-disaggregated. A qualified null below saturation, and the argument for why no amount of FlowPrefill would have helped there.
-
Preempting the Prefill, Part 2: Build
Implementing FlowPrefill in vLLM: the urgency math, the components, the policy on top of them, and the subtle races that nearly broke it all.
-
Preempting the Prefill, Part 1: Context
Why TTFT SLOs are hard to meet under contention, and what the FlowPrefill paper proposes to do about it. Setup for a three-part series on implementing the idea in vLLM.
-
Peer-to-Peer Caching for FUSE-Backed Content Stores, Part 1
Measuring the per-op cost of going through FUSE versus a kernel filesystem, as groundwork for a peer-to-peer blob-sharing layer.
-
A Scheduler as a Lens into LLM Inference
Building a job scheduler in Go and using it as a lens into LLM inference scheduling — tracing every design decision back to its vLLM parallel.