vLLM v1 + Mooncake Store Architecture

Focus: pure vLLM v1 scheduler path, non-V2 GPUModelRunner with KVConnectorModelRunnerMixin, and MooncakeStoreConnector. Code pointers are from /home/zhewen/repos/vllm at commit 1c78f76c29.

TL;DR

Think of the KV connector as a two-sided plugin. The scheduler side decides what prefix is already external, which GPU blocks were allocated, and what metadata the worker should see. The worker side receives that metadata in KVConnectorModelRunnerMixin, then returns a KVConnectorOutput containing completed sends/receives. For Mooncake Store specifically, start_load_kv() and wait_for_save() are intentionally no-ops: its real load/store work is issued from worker-side get_finished() to overlap with compute.

1. Mental model

Scheduler owns request truth

The scheduler owns Request states, local KV allocation, preemption, and when a request is safe to run again. A connector may delay frees or mark a request as waiting for remote KV, but the scheduler still owns self.requests and queue transitions.

scheduler.py KVCacheManager

Connector scheduler plans

MooncakeStoreScheduler does not move bytes. It performs external lookup, tracks per-request save/load intent, and emits MooncakeStoreConnectorMetadata for the worker.

get_num_new_matched_tokens ReqMeta

Model runner brackets forward

In non-V2 GPU execution, GPUModelRunner inherits KVConnectorModelRunnerMixin. The mixin binds connector metadata before forward and polls get_finished() after forward, even when no model tokens are run.

KVConnectorModelRunnerMixin

Mooncake worker moves bytes

MooncakeStoreWorker owns Mooncake store setup, lookup server, one ChunkedTokenDatabase per KV group, and background send/recv threads that call Mooncake batch put/get APIs.

MooncakeStoreWorker MooncakeDistributedStore
One-sentence architecture Scheduler-side connector code produces a per-step I/O plan; V1 model-runner mixin transports that plan into the worker; Mooncake Store worker translates the plan into store keys and GPU addresses; scheduler consumes completion signals on the next update_from_output().

2. Source map

Area File / lines Why it matters
Engine setup vllm/v1/engine/core.py:141-156, 202-211 Resolves scheduler/hash block sizes, constructs scheduler, initializes KV output aggregation, and creates request block hasher.
Core scheduler vllm/v1/core/sched/scheduler.py:597-603, 746-751, 887-922, 2127-2154 Calls connector scheduler hooks, packages SchedulerOutput, then consumes KVConnectorOutput.
V1 model runner vllm/v1/worker/gpu_model_runner.py:415-417, 4205-4236, 4499-4520 Non-V2 GPU runner inherits the mixin and wraps model forward with connector lifecycle.
KV mixin vllm/v1/worker/kv_connector_model_runner_mixin.py:85-119 Calls worker-side bind_connector_metadata(), start_load_kv(), wait_for_save(), and get_finished().
Mooncake connector facade .../mooncake/store/connector.py:80-147, 153-194, 222-264 Splits scheduler role from worker role and forwards abstract connector calls to concrete store components.
Mooncake scheduler .../mooncake/store/scheduler.py:47-84, 84-127, 163-366, 368-388 Tracks lookup/load/save intent and builds worker metadata.
Mooncake data model .../mooncake/store/data.py:26-149, 152-290 Defines PoolKey, ChunkedTokenDatabase, LoadSpec, RequestTracker, ReqMeta, and connector metadata.
Mooncake worker .../mooncake/store/worker.py:767-1075, 1090-1186, 1188-1236 Registers GPU memory, starts threads, issues I/O in get_finished(), and handles lookup.
Coordinator .../mooncake/store/coordinator.py:55-143, 145-191, 193-284 Reuses vLLM manager semantics to produce store/load masks and longest external hit length across hybrid KV groups.

3. Engine loop

The engine loop is the backbone. It asks the scheduler for a SchedulerOutput, sends that object to workers, gets a ModelRunnerOutput, then lets the scheduler update request state.

1. Scheduler scheduler.schedule()
Creates SchedulerOutput with connector metadata.
2. Executor execute_model(scheduler_output)
Sends the output to all worker processes.
3. Worker GPUModelRunner.execute_model()
Runs model forward and connector worker hooks.
4. Scheduler update_from_output()
Consumes generated tokens and KVConnectorOutput.
EngineCore.step()
  scheduler_output = scheduler.schedule()
  future = model_executor.execute_model(scheduler_output, non_block=True)
  model_output = future.result() or model_executor.sample_tokens(...)
  scheduler.update_from_output(scheduler_output, model_output)

4. Core scheduler hooks

In pure v1, the core scheduler calls three connector scheduler-side methods during scheduling, and one method when a request finishes. It later calls update_connector_output() when worker output comes back.

step
core scheduler
MooncakeStoreScheduler
SchedulerOutput
meaning
A
external lookup
connector.get_num_new_matched_tokens(request, local_hit)
Calls LookupKeyClient.lookup(); records LoadSpec when external tokens should be loaded.
Decides whether the request has a remote KV prefix and whether loading is async.
B
after allocation
connector.update_state_after_alloc(request, blocks, external_tokens)
Stores (Request, block_ids) in _unfinished_requests; flips LoadSpec.can_load after GPU blocks exist.
The connector cannot load until destination blocks are allocated.
C
metadata build
connector.build_connector_meta(scheduler_output)
Creates RequestTracker and ReqMeta entries for loads, saves, chunked prefills, resumed requests, and pending loads.
scheduler_output.kv_connector_metadata = meta
This is the scheduler-to-worker contract.
D
request finished
connector.request_finished_all_groups(request, block_ids)
Returns delay_free_blocks=True when saved tokens exist and blocks should remain pinned until send completion.
Prevents freeing GPU blocks before async store reads them.
E
worker output
_update_from_kv_xfer_finished(kv_connector_output)
connector.update_connector_output() currently aggregates KV events for Mooncake Store.
finished_recving
finished_sending
Recv completion unblocks waiting requests; send completion frees delayed blocks.
Important scheduler distinction Scheduler-side connector methods plan state. Worker-side connector methods move or poll data. The scheduler never directly calls MooncakeStoreWorker.get_finished().

5. V1 model runner mixin

If use_v2_model_runner is false, gpu_worker.py instantiates vllm.v1.worker.gpu_model_runner.GPUModelRunner. That class inherits KVConnectorModelRunnerMixin. The mixin is the bridge between SchedulerOutput and worker-side connector logic.

Normal forward path

  1. GPUModelRunner enters maybe_get_kv_connector_output().
  2. Mixin binds scheduler_output.kv_connector_metadata.
  3. Mixin calls start_load_kv(get_forward_context()).
  4. Model forward runs inside the context.
  5. On exit, mixin calls wait_for_save() and get_finished(...).
  6. Mixin attaches completed send/recv sets to KVConnectorOutput.

No-forward path

Even when total_num_scheduled_tokens == 0, the V1 runner calls kv_connector_no_forward() if a KV connector exists. That still binds metadata and calls get_finished(), which is necessary for background Mooncake loads/saves to make progress.

Pointer: gpu_model_runner.py:4004-4021 and kv_connector_model_runner_mixin.py:38-55.

KVConnectorModelRunnerMixin._get_kv_connector_output(...)
  output = KVConnectorOutput()
  kv_connector = get_kv_transfer_group()
  kv_connector.bind_connector_metadata(scheduler_output.kv_connector_metadata)
  kv_connector.start_load_kv(get_forward_context())
  try:
      yield output                 # model forward happens here
  finally:
      kv_connector.wait_for_save()
      output.finished_sending, output.finished_recving =
          kv_connector.get_finished(scheduler_output.finished_req_ids)
      output.kv_cache_events = kv_connector.get_kv_connector_kv_cache_events()
      kv_connector.clear_connector_metadata()
Mooncake Store twist The mixin still calls start_load_kv() and wait_for_save(), but Mooncake Store implements both as no-ops. Its loads and stores are issued from get_finished() instead. This is why understanding Mooncake Store requires reading the worker get_finished() method, not just the generic connector lifecycle.

6. Mooncake Store pieces

Connector facade

MooncakeStoreConnector subclasses KVConnectorBase_V1 and SupportsHMA. Depending on role, it creates either MooncakeStoreScheduler or MooncakeStoreWorker.

role=SCHEDULER role=WORKER

Scheduler component

MooncakeStoreScheduler owns load_specs, _request_trackers, _unfinished_requests, and _preempted_req_ids. It packages ReqMeta records into metadata.

LoadSpec RequestTracker ReqMeta

Worker component

MooncakeStoreWorker sets up the Mooncake store, registers KV cache GPU buffers, starts a lookup server on rank 0, creates token DBs, and starts send/recv transfer threads.

batch_is_exist batch_get batch_put

Data objects

Object Side Role
LoadSpec Scheduler -> worker Records local cached tokens, external cached tokens, and whether a load is now legal.
RequestTracker Scheduler internal Tracks token length, allocated blocks, saved-token watermark, token IDs, and prefill boundary.
ReqMeta Metadata payload Per-request command: token length to save/load, block IDs, block hashes, load spec, save flag, CUDA event slot.
PoolKey Worker/store String key containing model, TP/PCP/DCP/PP ranks, group ID, and chunk hash.
ChunkedTokenDatabase Worker/store Maps token chunks to PoolKey and maps block IDs to GPU addresses/sizes.
MooncakeStoreCoordinator Worker lookup/load/store Computes longest external hit and per-group masks using vLLM KV cache manager semantics.

7. External prefix lookup

Lookup begins in the core scheduler, but it is serviced by a worker-local ZMQ server on rank 0. The scheduler asks, "how many prefix tokens already exist in Mooncake Store for this request's block hashes?" The worker answers with a scheduler-safe prefix length.

step
Core scheduler
MooncakeStoreScheduler
Lookup server / worker
Mooncake store
1
get_num_new_matched_tokens()
Rounds lookup length to scheduler block boundary when partial chunks are discarded.
2
LookupKeyClient.lookup(token_len, block_hashes)
ZMQ LookupKeyServer receives token length and hashes.
3
MooncakeStoreWorker.lookup() expands candidate keys across TP and PP ranks.
batch_is_exist(candidate_keys)
4
MooncakeStoreCoordinator.find_longest_cache_hit() converges HMA groups and masks.
5
Uses returned token count as external cached tokens.
Stores LoadSpec if remote load is needed.
Why lookup is worker-backed The scheduler process does not own Mooncake store handles or registered GPU address layout. Rank-0 worker has the store instance and can translate request hashes into concrete Mooncake keys.

8. Scheduler-to-worker metadata

MooncakeStoreConnectorMetadata is not just a list of current scheduled requests. It also carries unfinished request IDs and preemptions, and each ReqMeta tells the worker whether to load, save, both, or only poll completion.

MooncakeStoreConnectorMetadata
  unfinished_request_ids: set[str]
  preempted_req_ids: set[str]
  requests: list[ReqMeta]

ReqMeta
  req_id
  token_len_chunk
  block_ids: tuple[list[int], ...]      # one list per KV cache group
  block_hashes
  can_save
  load_spec
  is_last_chunk
  current_event
  token_ids
  original_block_size

Fresh or resumed prefill

The scheduler creates a RequestTracker from the request's prefill token range and allocated block IDs. It builds ReqMeta with optional LoadSpec, block hashes, and a save watermark.

Decode / chunked continuation

The scheduler extends the existing tracker with new block IDs, advances the token length, and emits more save metadata only while the request is still inside the tracked prefill range.

9. Worker I/O path

Mooncake Store intentionally centralizes I/O issue and completion polling in worker-side get_finished(). The method does three things in one pass: enqueue loads, enqueue stores, and poll completions from prior work.

phase
Mixin / connector facade
MooncakeStoreWorker
Recv thread
Send thread
bind
bind_connector_metadata(meta)
Metadata is stored on connector base.
start
start_load_kv(...)
No-op for Mooncake Store.
forward
Model forward executes.
finish
get_finished(finished_req_ids)
Reads metadata; enqueues load and store jobs.
add_request(load ReqMeta)
add_request(save ReqMeta)
I/O
batch_get_into_multi_buffers()
batch_put_from_multi_buffers()
poll
Returns KVConnectorOutput.
Collects done sets.
done_recving
done_sending

Load path

  1. get_finished() finds each request with load_spec.can_load.
  2. It sets load_spec.token_len to the external cached length.
  3. It enqueues the request on KVCacheStoreRecvingThread.
  4. The recv thread uses coord.load_mask() to skip chunks the local KV spec would not populate.
  5. It turns each chunk into Mooncake keys and GPU address/size lists.
  6. It calls batch_get_into_multi_buffers() and marks the request finished.

Store path

  1. get_finished() records a CUDA event when any request can save.
  2. Each save request receives that event and is enqueued on KVCacheStoreSendingThread.
  3. The send thread aligns token length to coord.lcm_block_size.
  4. It uses coord.store_mask() to avoid storing chunks that no future hit would need.
  5. It deduplicates with batch_is_exist().
  6. It synchronizes the CUDA event before reading GPU memory.
  7. It calls batch_put_from_multi_buffers() and decrements outstanding store jobs.

10. Completion feedback

Completion has two meanings: finished_recving means remote KV is now loaded into local GPU blocks, while finished_sending means blocks whose free was delayed for store can now be released.

Worker get_finished()
returns (done_sending, done_recving)
Model output ModelRunnerOutput.kv_connector_output
carries the two sets to scheduler.
Scheduler _update_from_kv_xfer_finished()
handles send/recv sets.
Queues recv moves waiting request back toward scheduling; send frees delayed blocks.
Worker signal Scheduler action Code pointer
finished_recving If request is WAITING_FOR_REMOTE_KVS, add request ID to finished_recving_kv_req_ids. A later scheduler pass promotes it and caches blocks. scheduler.py:2142-2148, 2060-2109
finished_sending Free blocks and remove the request row, because the connector has completed any delayed save/send responsibility. scheduler.py:2151-2154
kv_cache_events Mooncake connector facade aggregates worker KV events and exposes them through take_events(). connector.py:196-216

11. HMA and block-size frames

Mooncake Store with HMA is easiest to reason about if you keep three token frames separate. Mixing these frames is the usual source of wrong conclusions.

Frame Meaning Where used
native group block size Each KV cache group's own physical/spec block size. ChunkedTokenDatabase per group, coordinator masks, address mapping.
hash_block_size Granularity of Request.block_hashes. Request block hasher, BlockHashListWithBlockSize, key generation.
scheduler_block_size Safe prefix length alignment visible to the scheduler. Scheduler num_computed_tokens, Mooncake hit length, HMA convergence.
Resolver rule Single group returns cache_config.block_size * dcp * pcp for both scheduler and hash block sizes. Multiple groups use LCM of group block sizes for scheduler alignment, and either explicit hash_block_size or GCD of group block sizes for hash granularity.

The coordinator is the HMA bridge. It mirrors vLLM cache-manager hit logic over an ExternalCachedBlockPool, not over local allocated blocks. Lookup feeds it an existence set from Mooncake keys. Load and store paths ask it for masks so each group only transfers chunks that the corresponding KV spec would actually use.

MooncakeStoreCoordinator
  find_longest_cache_hit(block_hashes, max_length, ExternalCachedBlockPool)
    -> (load_mask_per_group, hit_length)

  load_mask(block_hashes, token_len)
    -> which chunks to load locally

  store_mask(aligned_token_len)
    -> which chunks are worth storing for future hits

12. Suggested read order

  1. Start with vllm/v1/engine/core.py:141-156 and 202-211 to see how scheduler/hash block sizes and request hashing are wired into v1.
  2. Read the scheduler connector calls in vllm/v1/core/sched/scheduler.py:597-603, 746-751, 887-922, and 2127-2154.
  3. Read vllm/v1/worker/kv_connector_model_runner_mixin.py:85-119 to understand exactly when worker-side connector methods are invoked.
  4. Read mooncake/store/connector.py:80-147 to see how one connector class splits into scheduler and worker role objects.
  5. Read mooncake/store/scheduler.py:84-127 for lookup and 163-366 for metadata construction.
  6. Read mooncake/store/data.py:70-149 for key/address translation, then 195-290 for metadata payload shape.
  7. Read mooncake/store/worker.py:1090-1186 first, then jump back to 393-623 and 626-759 for send/recv thread internals.
  8. Finish with mooncake/store/coordinator.py:55-284; it explains why HMA lookup/load/store are not just flat block-list operations.
Practical debugging tip When a request appears stuck, ask which boundary failed: scheduler never emitted load metadata, worker never enqueued the recv, recv completed but finished_recving did not reach scheduler, or scheduler received it but did not promote WAITING_FOR_REMOTE_KVS.