Preempting the Prefill, Part 3: Results & Benchmark

Part 2 walked through the components that make slack-aware preemption work in vLLM, the design decisions that diverge from the paper, and the reasoning behind each. This last post in the series covers the benchmark setup and what we measured.

Benchmark design

Section Parameter Value
Hardware & topology Hardware 6× A100 SXM4 80GB on RunPod
  Model Llama 70B (bf16)
  Prefill TP = 4
  Decode TP = 2
  KV transfer NIXL
Workload Prompts ShareGPT, 256–8000 tokens
  Decode cap 4 tokens per request (TTFT is the metric)
  Tier mix 20% urgent, 80% generous
  Urgent SLO band [1.0, 1.3] × baseline TTFT
  Generous SLO band [3, 10] × baseline TTFT
  Arrivals Poisson
Predictor TTFT predictor a = 0.173 ms/tok, c = 34 ms (refit per model/hardware/TP)
Measurement Window 30 s warmup + 300 s recorded
  Policies control (FCFS), conservative, aggressive
  Trials 5 paired per (rate, policy) cell

A few notes on the setup itself:

  • TP split: 4 shards on prefill, 2 on decode follows from capping each request at 4 decode tokens. TTFT is the metric, so per-request decode work is bounded, so the decode node doesn’t need as many shards as the prefill node, where compute dominates.
  • KV transfer via NIXL: prefill and decode live on physically separate GPUs, so KV blocks have to move across once prefill finishes (NIXL is vLLM’s standard RDMA connector for that handoff).
  • ShareGPT for prompts: real ChatGPT conversations, and the de facto vLLM benchmark workload (right length distribution and the community precedent that makes the numbers comparable).
  • SLO bands: multipliers on each request’s own baseline TTFT, so an urgent request at 500 tokens and one at 5000 tokens both get a band that’s tight against their own work, not against a single absolute number that’s wrong for both.

Benchmark operational flow: filter ShareGPT prompts to 256–8000 tokens, refit the TTFT predictor for this (model, hardware, TP) once, then run the measurement loop — 30s warmup + 300s recorded per trial, five paired trials per (rate, policy) cell. Benchmark operational flow: filter ShareGPT prompts to 256–8000 tokens, refit the TTFT predictor for this (model, hardware, TP) once, then run the measurement loop — 30s warmup + 300s recorded per trial, five paired trials per (rate, policy) cell.

Filter and refit are one-time setup; the measurement loop then spins through every (rate × policy × trial) combination. The predictor’s coefficients are hardware-specific. The refit step measures them on this exact setup before the runs start, so slack is honest. Skip it and every preempt decision is acting on a biased estimate.

Benchmark structure: the sweep grid of (rate × policy) cells on one side, with one cell zoomed to show its five paired trials — each trial replays the same arrival stream against control, conservative, and aggressive, with 30s warmup discarded and 300s recorded. Benchmark structure: the sweep grid of (rate × policy) cells on one side, with one cell zoomed to show its five paired trials — each trial replays the same arrival stream against control, conservative, and aggressive, with 30s warmup discarded and 300s recorded.

Within a cell, the same arrival stream replays against control, conservative, and aggressive, so any attainment gap is the policy, not the prompts. Five trials per cell average out the residual noise.

Results

Once the mechanism is wired end-to-end, the SLO monitor flags an intent, the workers vote across ranks at every attention boundary, and the preempted request unwinds:

[slo_monitor] PREEMPT INTENT (route=in-batch, step_id=8421):
  waiting req-94  (tokens=2048 slack=-115.3ms  ttft_pred=388.2ms  prio=8.7e-04)
  would displace
  running req-89  (tokens=5120 slack=+412.7ms  ttft_pred=920.4ms  prio=2.4e-03)
  [margin=1.20x, snapshot_age=42.1ms, stubborn=0/4]

[preempt_check] rank=0 layer=layers.18.attn  layer_idx=18/80
  vote=1  target_step=8421  current_step=8421

Each policy fires as often as its rules allow: aggressive fires throughout the sweep; conservative rarely does, since it only preempts when a waiter is much more urgent than all the running requests.

Bar chart of preempt-intent events per 5-trial block. Aggressive fires preempts, more at wider SLO bands; conservative fires essentially none.

Sweeping the request rate from 2 to 10 req/s:

SLO attainment vs arrival rate, two panels (urgent and generous tier). Three lines — control, conservative, aggressive — sit on top of each other from rate 2 through rate 8. At rate 10, all three drop to near zero on the urgent panel, while on the generous panel control collapses to ~0% but conservative and aggressive separate up to ~5–6%.

Below saturation (rates 2–8) the three policies sit within ±2 percentage points of each other on urgent attainment, identical on generous.

The intuition is a single checkout line with one cashier who takes the same time per customer regardless of order. If everyone has slack on their deadline, letting an urgent customer cut just moves the slack around (nobody was going to miss anyway).

The interesting cell is rate 10. Control collapses (effectively no request meets its SLO). Both FlowPrefill policies pull a few percentage points of rescue:

Policy Goodput (req/s) Urgent attainment % Generous attainment %
control 0.001 0.00 0.02
conservative 0.323 0.82 3.86
aggressive 0.443 2.24 5.02

That’s roughly 300–450× control’s goodput at the saturation point. Control is FCFS: it serves the oldest request first, which under saturation is also the one whose deadline already passed. By the time it finishes that one, the request behind it is also past deadline, and so on; every request inherits the same fate. FlowPrefill breaks the chain by doing load shedding inside the prefill phase: abandoning work that was going to miss anyway to free the GPU for work that can still meet its deadline.

The rescue lands mostly on the generous tier: those requests have wide enough SLO bands to still be saveable when the queue is deep. Urgent waiters at rate 10 are mostly past saving by the time the monitor sees them, and the policy explicitly refuses to fire a preempt for a doomed waiter (the wasted re-prefill cost would exceed any possible save). So the urgent-tier rescue is real but small (0.82% / 2.24% in aggregate), and the bulk of the lift comes from saving the wider-band generous requests that can still be helped.

What I ruled out

To check whether FlowPrefill could have helped more with different tuning, I swept both the SLO band and the stubbornness threshold, and neither moved the result.

First, the urgent SLO band. If urgent SLOs were generous enough that requests met deadline without help, narrowing the band should increase goodput attainment. At rate 6, I swept the urgent multiplier from 1.0× to 3.0×:

Attainment vs urgent SLO band width at rate 6, two panels for urgent and generous tier. Urgent attainment rises monotonically as the band widens, but the three policies move together across the entire sweep.

Attainment does rise as the band widens (that’s expected), but the three policies move together across the whole sweep (band-tightness isn’t the cap).

Second, the stubbornness threshold. Rule 2’s layer-fraction threshold (default 0.9) blocks preempts past 90% of the forward pass. If that’s where useful preempts live, lowering it should increase goodput attainment. I swept the threshold over {0.0, 0.5, 0.9} at rate 8, and the three policies moved together across all three thresholds, with no increase in goodput attainment (stubbornness isn’t the cap either).

Scope and what’s next

This ran on a fixed setup: Llama 70B on A100 (TP=4 prefill, TP=2 decode), serving ShareGPT prompts with Poisson arrivals. Moreover, chunked prefill was off and prefix caching was disabled. The same reasoning should apply more broadly, but I haven’t tested:

  • Heavier-tailed arrivals (Poisson bursts are brief, but heavier-tailed traffic could create longer periods of deep queue where FlowPrefill might actually help, even when the overall rate stays below saturation).
  • Chunked prefill on (vLLM’s production default). The current implementation requires it off.
  • Other models or hardware (TTFT coefficients are GPU-specific, though the policy choices aren’t).

The fork is on GitHub at adiu19/vllm.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Preempting the Prefill, Part 2: Build
  • Preempting the Prefill, Part 1: Context
  • A Scheduler as a Lens into LLM Inference