Clustered Thoughts

Preempting the Prefill, Part 3: Results & Benchmark

Paired-trial benchmarks of FlowPrefill on Llama 70B, P/D-disaggregated. A qualified null below saturation, and the argument for why no amount of FlowPrefill would have helped there.

9 min read

vLLM · LLM Infrastructure
Preempting the Prefill, Part 2: Build

Implementing FlowPrefill in vLLM: the urgency math, the components, the policy on top of them, and the subtle races that nearly broke it all.

14 min read

vLLM · LLM Infrastructure
Preempting the Prefill, Part 1: Context

Why TTFT SLOs are hard to meet under contention, and what the FlowPrefill paper proposes to do about it. Setup for a three-part series on implementing the idea in vLLM.

7 min read

vLLM · LLM Infrastructure
Peer-to-Peer Caching for FUSE-Backed Content Stores, Part 1

Measuring the per-op cost of going through FUSE versus a kernel filesystem, as groundwork for a peer-to-peer blob-sharing layer.

9 min read

FUSE Distributed Systems Performance Go · Distributed Systems Systems Programming
A Scheduler as a Lens into LLM Inference

Building a job scheduler in Go and using it as a lens into LLM inference scheduling — tracing every design decision back to its vLLM parallel.

9 min read

Go Schedulers vLLM LLM Inference · LLM Infrastructure Systems Programming