vllm的kv_cache_manager allocate_slots() 在先前的schedule()中执行请求的踢出/加入前,kv_cache_manager首先根据request的new_tokens信息尝试分配新的显存block,本节主要看一下这个kv_cache_manager都做了什么事情。
1 2 3 4 5 new_blocks = self .kv_cache_manager.allocate_slots( request, num_new_tokens, num_lookahead_tokens=self .num_lookahead_tokens, )
进入到这个方法内,方法里有一大段注释解释了Block的的分布,其中有一个number_external_cache,这个是在P/D(Prefill/Decode)架构下才会有的,这里涉及到跨实例的KV Cache传输,这里先有一个概念即可。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ``` ---------------------------------------------------------------------- | < comp > | < new_comp > | < ext_comp > | < new > | < lookahead > | ---------------------------------------------------------------------- | < to be computed > | ---------------------------------------------------------------------- | < to be allocated > | ---------------------------------------------------------------------- | < to be cached (roughly, | | details below)> | ---------------------------------------------------------------------- | Prefix-cached tokens from either vLLM | | or connector. Can be safely removed if || they are outside sliding window. | ---------------------------------------------------------------------- | < cached by vLLM > | not cached by | | vLLM, but | | ref_cnt | ref_cnt not | cached by || increased| increased yet| connector | ----------------------------------------------------------------------
Abbrivations:
1 2 3 4 5 6 comp = request.num_computed_tokens new_comp = num_new_computed_tokens = len(new_computed_blocks) * block_size ext_comp = num_external_computed_tokens, cached by the connector new = num_new_tokens, including unverified draft tokens lookahead = num_lookahead_tokens
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 紧接着方法内部计算了本地缓存的已经缓存的tokens: `num_local_computed_tokens`,这个由两部分构成,一是num_computed_tokens, 第二个是num_new_computed_tokens。 >num_new_computed_tokens是什么东西? 分析一下外部scheduler对这个方法的调用,发现有下边四种情况:1 . 处理Running request 这种情况下传入的参数为默认值就是0 (所有的token都已经在prefix cache里了)2 . 处理Waiting request(新请求/先前被踢出的请求) 在外部有这样的一段调用, if判断说明了这里是在处理新的请求或者曾经被踢出的请求,由于block的清空策略是lazy的,所以即使这个请求被踢出了,他的KV Cache 可能还存在于物理内存中,所以需要先计算一下曾经cache的tokens。 ```python if request.num_computed_tokens == 0: # Get locally-cached tokens. new_computed_blocks, num_new_local_computed_tokens = ( self.kv_cache_manager.get_computed_blocks(request) )
prefix cache未命中或者是enable_prefix_cache==False 这种情况下很显然new_computed_tokens和num_computed_tokens都是0
PD分离情况 1 2 3 4 else : new_computed_blocks = self .kv_cache_manager.empty_kv_cache_blocks num_new_local_computed_tokens = 0 num_computed_tokens = request.num_computed_tokens
思考:现在的PD分离流程似乎是这样的
请求 → Prefill 实例(max_tokens=1,只做 prefill)
Prefill 完成 → 返回 KV 元数据(block 地址等)
请求 + KV 元数据 → Decode 实例
Decode 分配 block,接收 KV 传输
开始 decode
总延迟 = prefill + 分配blcok + kv传输 + decode
这样的优化方式是否有可行之处?
请求同时发给 Prefill 和 Decode
Decode 立即分配 block 占位(很快)
Prefill 同时计算 KV
Prefill 完成后直接传输到 Decode 预分配的 block
开始 decode
总延迟 = max(prefill, 分配block) + kv传输 + decode
难点: 需要令prefill实例预先知道目标decode实例/block_id(这个需要先完成blcok分配才可以获取),需要中心化的调度器
回到代码中,这里其他的几个变量就见名知意了,下边有一条remove_skipped_blocks,这个是有的模型支持的是slide-window-attn,所以窗口外的cache就可以remove了,不过大部分应该都是full-attention。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 num_local_computed_tokens = ( request.num_computed_tokens + num_new_computed_tokens ) total_computed_tokens = min ( num_local_computed_tokens + num_external_computed_tokens, self .max_model_len, ) num_tokens_main_model = total_computed_tokens + num_new_tokens num_tokens_need_slot = min ( num_tokens_main_model + num_lookahead_tokens, self .max_model_len, )self .coordinator.remove_skipped_blocks( request.request_id, total_computed_tokens )
下边就是根据先前计算的数据来创建并分配block对象了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 if ( new_computed_block_list is not self .empty_kv_cache_blocks.blocks or num_external_computed_tokens > 0 ): self .coordinator.allocate_new_computed_blocks( request_id=request.request_id, new_computed_blocks=new_computed_block_list, num_local_computed_tokens=num_local_computed_tokens, num_external_computed_tokens=num_external_computed_tokens, ) new_blocks = self .coordinator.allocate_new_blocks( request.request_id, num_tokens_need_slot, num_tokens_main_model, num_encoder_tokens, )if not self .enable_caching or delay_cache_blocks: return self .create_kv_cache_blocks(new_blocks) num_tokens_to_cache = min ( total_computed_tokens + num_new_tokens, request.num_tokens, )self .coordinator.cache_blocks(request, num_tokens_to_cache)return self .create_kv_cache_blocks(new_blocks)
来自deepwiki的时序逻辑,具体的connector后续再开一坑
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Step N: allocate_slots(delay_cache_blocks=True) → 分配 block 占位,不 cache connector.update_state_after_alloc() → 告诉 connector "把 KV 写到这些 block ID" request.status = WAITING_FOR_REMOTE_KVS [后台:connector 后台线程接收 KV 数据,写入 block] Step N+M(某次调度开始时): connector 报告传输完成 → finished_recving_kv_req_ids.add(request_id) _update_waiting_for_remote_kv(request) → kv_cache_manager.cache_blocks() ← 这里才真正 cache → request.status = WAITING Step N+M+1 : request 重新进入 waiting 队列调度 num_computed_tokens > 0 ,跳过 get_computed_blocks() 正常分配剩余 block,进入 running 队列开始 decode