LookupKeyClient.lookup(); records LoadSpec
when external tokens should be loaded.
Focus: pure vLLM v1 scheduler path, non-V2 GPUModelRunner with
KVConnectorModelRunnerMixin, and MooncakeStoreConnector.
Code pointers are from /home/zhewen/repos/vllm at commit
1c78f76c29.
Think of the KV connector as a two-sided plugin. The scheduler side decides
what prefix is already external, which GPU blocks were allocated, and what
metadata the worker should see. The worker side receives that metadata in
KVConnectorModelRunnerMixin, then returns a
KVConnectorOutput containing completed sends/receives. For
Mooncake Store specifically, start_load_kv() and
wait_for_save() are intentionally no-ops: its real load/store
work is issued from worker-side get_finished() to overlap with
compute.
The scheduler owns Request states, local KV allocation,
preemption, and when a request is safe to run again. A connector may
delay frees or mark a request as waiting for remote KV, but the scheduler
still owns self.requests and queue transitions.
MooncakeStoreScheduler does not move bytes. It performs
external lookup, tracks per-request save/load intent, and emits
MooncakeStoreConnectorMetadata for the worker.
In non-V2 GPU execution, GPUModelRunner inherits
KVConnectorModelRunnerMixin. The mixin binds connector
metadata before forward and polls get_finished() after
forward, even when no model tokens are run.
MooncakeStoreWorker owns Mooncake store setup, lookup
server, one ChunkedTokenDatabase per KV group, and background
send/recv threads that call Mooncake batch put/get APIs.
update_from_output().
| Area | File / lines | Why it matters |
|---|---|---|
| Engine setup | vllm/v1/engine/core.py:141-156, 202-211 |
Resolves scheduler/hash block sizes, constructs scheduler, initializes KV output aggregation, and creates request block hasher. |
| Core scheduler | vllm/v1/core/sched/scheduler.py:597-603, 746-751, 887-922, 2127-2154 |
Calls connector scheduler hooks, packages SchedulerOutput, then consumes KVConnectorOutput. |
| V1 model runner | vllm/v1/worker/gpu_model_runner.py:415-417, 4205-4236, 4499-4520 |
Non-V2 GPU runner inherits the mixin and wraps model forward with connector lifecycle. |
| KV mixin | vllm/v1/worker/kv_connector_model_runner_mixin.py:85-119 |
Calls worker-side bind_connector_metadata(), start_load_kv(), wait_for_save(), and get_finished(). |
| Mooncake connector facade | .../mooncake/store/connector.py:80-147, 153-194, 222-264 |
Splits scheduler role from worker role and forwards abstract connector calls to concrete store components. |
| Mooncake scheduler | .../mooncake/store/scheduler.py:47-84, 84-127, 163-366, 368-388 |
Tracks lookup/load/save intent and builds worker metadata. |
| Mooncake data model | .../mooncake/store/data.py:26-149, 152-290 |
Defines PoolKey, ChunkedTokenDatabase, LoadSpec, RequestTracker, ReqMeta, and connector metadata. |
| Mooncake worker | .../mooncake/store/worker.py:767-1075, 1090-1186, 1188-1236 |
Registers GPU memory, starts threads, issues I/O in get_finished(), and handles lookup. |
| Coordinator | .../mooncake/store/coordinator.py:55-143, 145-191, 193-284 |
Reuses vLLM manager semantics to produce store/load masks and longest external hit length across hybrid KV groups. |
The engine loop is the backbone. It asks the scheduler for a
SchedulerOutput, sends that object to workers, gets a
ModelRunnerOutput, then lets the scheduler update request state.
scheduler.schedule()SchedulerOutput with connector metadata.
execute_model(scheduler_output)GPUModelRunner.execute_model()update_from_output()KVConnectorOutput.
EngineCore.step()
scheduler_output = scheduler.schedule()
future = model_executor.execute_model(scheduler_output, non_block=True)
model_output = future.result() or model_executor.sample_tokens(...)
scheduler.update_from_output(scheduler_output, model_output)
In pure v1, the core scheduler calls three connector scheduler-side methods
during scheduling, and one method when a request finishes. It later calls
update_connector_output() when worker output comes back.
connector.get_num_new_matched_tokens(request, local_hit)
LookupKeyClient.lookup(); records LoadSpec
when external tokens should be loaded.
connector.update_state_after_alloc(request, blocks, external_tokens)
(Request, block_ids) in _unfinished_requests;
flips LoadSpec.can_load after GPU blocks exist.
connector.build_connector_meta(scheduler_output)
RequestTracker and ReqMeta entries for
loads, saves, chunked prefills, resumed requests, and pending loads.
scheduler_output.kv_connector_metadata = meta
connector.request_finished_all_groups(request, block_ids)
delay_free_blocks=True when saved tokens exist and
blocks should remain pinned until send completion.
_update_from_kv_xfer_finished(kv_connector_output)
connector.update_connector_output() currently aggregates KV
events for Mooncake Store.
finished_recvingfinished_sending
MooncakeStoreWorker.get_finished().
If use_v2_model_runner is false, gpu_worker.py
instantiates vllm.v1.worker.gpu_model_runner.GPUModelRunner.
That class inherits KVConnectorModelRunnerMixin. The mixin is the
bridge between SchedulerOutput and worker-side connector logic.
GPUModelRunner enters maybe_get_kv_connector_output().scheduler_output.kv_connector_metadata.start_load_kv(get_forward_context()).wait_for_save() and get_finished(...).KVConnectorOutput.
Even when total_num_scheduled_tokens == 0, the V1 runner
calls kv_connector_no_forward() if a KV connector exists.
That still binds metadata and calls get_finished(), which is
necessary for background Mooncake loads/saves to make progress.
Pointer: gpu_model_runner.py:4004-4021 and
kv_connector_model_runner_mixin.py:38-55.
KVConnectorModelRunnerMixin._get_kv_connector_output(...)
output = KVConnectorOutput()
kv_connector = get_kv_transfer_group()
kv_connector.bind_connector_metadata(scheduler_output.kv_connector_metadata)
kv_connector.start_load_kv(get_forward_context())
try:
yield output # model forward happens here
finally:
kv_connector.wait_for_save()
output.finished_sending, output.finished_recving =
kv_connector.get_finished(scheduler_output.finished_req_ids)
output.kv_cache_events = kv_connector.get_kv_connector_kv_cache_events()
kv_connector.clear_connector_metadata()
start_load_kv() and
wait_for_save(), but Mooncake Store implements both as no-ops.
Its loads and stores are issued from get_finished() instead.
This is why understanding Mooncake Store requires reading the worker
get_finished() method, not just the generic connector lifecycle.
MooncakeStoreConnector subclasses
KVConnectorBase_V1 and SupportsHMA. Depending on
role, it creates either MooncakeStoreScheduler or
MooncakeStoreWorker.
MooncakeStoreScheduler owns load_specs,
_request_trackers, _unfinished_requests, and
_preempted_req_ids. It packages ReqMeta records
into metadata.
MooncakeStoreWorker sets up the Mooncake store, registers KV
cache GPU buffers, starts a lookup server on rank 0, creates token DBs,
and starts send/recv transfer threads.
| Object | Side | Role |
|---|---|---|
LoadSpec |
Scheduler -> worker | Records local cached tokens, external cached tokens, and whether a load is now legal. |
RequestTracker |
Scheduler internal | Tracks token length, allocated blocks, saved-token watermark, token IDs, and prefill boundary. |
ReqMeta |
Metadata payload | Per-request command: token length to save/load, block IDs, block hashes, load spec, save flag, CUDA event slot. |
PoolKey |
Worker/store | String key containing model, TP/PCP/DCP/PP ranks, group ID, and chunk hash. |
ChunkedTokenDatabase |
Worker/store | Maps token chunks to PoolKey and maps block IDs to GPU addresses/sizes. |
MooncakeStoreCoordinator |
Worker lookup/load/store | Computes longest external hit and per-group masks using vLLM KV cache manager semantics. |
Lookup begins in the core scheduler, but it is serviced by a worker-local ZMQ server on rank 0. The scheduler asks, "how many prefix tokens already exist in Mooncake Store for this request's block hashes?" The worker answers with a scheduler-safe prefix length.
get_num_new_matched_tokens()LookupKeyClient.lookup(token_len, block_hashes)LookupKeyServer receives token length and hashes.MooncakeStoreWorker.lookup() expands candidate keys across TP and PP ranks.batch_is_exist(candidate_keys)MooncakeStoreCoordinator.find_longest_cache_hit() converges HMA groups and masks.LoadSpec if remote load is needed.
MooncakeStoreConnectorMetadata is not just a list of current
scheduled requests. It also carries unfinished request IDs and preemptions,
and each ReqMeta tells the worker whether to load, save, both, or
only poll completion.
MooncakeStoreConnectorMetadata
unfinished_request_ids: set[str]
preempted_req_ids: set[str]
requests: list[ReqMeta]
ReqMeta
req_id
token_len_chunk
block_ids: tuple[list[int], ...] # one list per KV cache group
block_hashes
can_save
load_spec
is_last_chunk
current_event
token_ids
original_block_size
The scheduler creates a RequestTracker from the request's
prefill token range and allocated block IDs. It builds ReqMeta
with optional LoadSpec, block hashes, and a save watermark.
The scheduler extends the existing tracker with new block IDs, advances the token length, and emits more save metadata only while the request is still inside the tracked prefill range.
Mooncake Store intentionally centralizes I/O issue and completion polling in
worker-side get_finished(). The method does three things in one
pass: enqueue loads, enqueue stores, and poll completions from prior work.
bind_connector_metadata(meta)start_load_kv(...)get_finished(finished_req_ids)add_request(load ReqMeta)add_request(save ReqMeta)batch_get_into_multi_buffers()batch_put_from_multi_buffers()KVConnectorOutput.done_recvingdone_sendingget_finished() finds each request with load_spec.can_load.load_spec.token_len to the external cached length.KVCacheStoreRecvingThread.coord.load_mask() to skip chunks the local KV spec would not populate.batch_get_into_multi_buffers() and marks the request finished.get_finished() records a CUDA event when any request can save.KVCacheStoreSendingThread.coord.lcm_block_size.coord.store_mask() to avoid storing chunks that no future hit would need.batch_is_exist().batch_put_from_multi_buffers() and decrements outstanding store jobs.
Completion has two meanings:
finished_recving means remote KV is now loaded into local GPU
blocks, while finished_sending means blocks whose free was delayed
for store can now be released.
get_finished()(done_sending, done_recving)
ModelRunnerOutput.kv_connector_output_update_from_kv_xfer_finished()| Worker signal | Scheduler action | Code pointer |
|---|---|---|
finished_recving |
If request is WAITING_FOR_REMOTE_KVS, add request ID to
finished_recving_kv_req_ids. A later scheduler pass promotes
it and caches blocks.
|
scheduler.py:2142-2148, 2060-2109 |
finished_sending |
Free blocks and remove the request row, because the connector has completed any delayed save/send responsibility. | scheduler.py:2151-2154 |
kv_cache_events |
Mooncake connector facade aggregates worker KV events and exposes them
through take_events().
|
connector.py:196-216 |
Mooncake Store with HMA is easiest to reason about if you keep three token frames separate. Mixing these frames is the usual source of wrong conclusions.
| Frame | Meaning | Where used |
|---|---|---|
native group block size |
Each KV cache group's own physical/spec block size. | ChunkedTokenDatabase per group, coordinator masks, address mapping. |
hash_block_size |
Granularity of Request.block_hashes. |
Request block hasher, BlockHashListWithBlockSize, key generation. |
scheduler_block_size |
Safe prefix length alignment visible to the scheduler. | Scheduler num_computed_tokens, Mooncake hit length, HMA convergence. |
cache_config.block_size * dcp * pcp for both
scheduler and hash block sizes. Multiple groups use LCM of group block sizes
for scheduler alignment, and either explicit hash_block_size or
GCD of group block sizes for hash granularity.
The coordinator is the HMA bridge. It mirrors vLLM cache-manager hit logic
over an ExternalCachedBlockPool, not over local allocated blocks.
Lookup feeds it an existence set from Mooncake keys. Load and store paths ask
it for masks so each group only transfers chunks that the corresponding KV
spec would actually use.
MooncakeStoreCoordinator
find_longest_cache_hit(block_hashes, max_length, ExternalCachedBlockPool)
-> (load_mask_per_group, hit_length)
load_mask(block_hashes, token_len)
-> which chunks to load locally
store_mask(aligned_token_len)
-> which chunks are worth storing for future hits
vllm/v1/engine/core.py:141-156 and
202-211 to see how scheduler/hash block sizes and request hashing
are wired into v1.
vllm/v1/core/sched/scheduler.py:597-603,
746-751, 887-922, and 2127-2154.
vllm/v1/worker/kv_connector_model_runner_mixin.py:85-119
to understand exactly when worker-side connector methods are invoked.
mooncake/store/connector.py:80-147 to see how one connector
class splits into scheduler and worker role objects.
mooncake/store/scheduler.py:84-127 for lookup and
163-366 for metadata construction.
mooncake/store/data.py:70-149 for key/address translation,
then 195-290 for metadata payload shape.
mooncake/store/worker.py:1090-1186 first, then jump back to
393-623 and 626-759 for send/recv thread internals.
mooncake/store/coordinator.py:55-284; it explains
why HMA lookup/load/store are not just flat block-list operations.
finished_recving did not reach scheduler, or scheduler received
it but did not promote WAITING_FOR_REMOTE_KVS.