[Serving] Prefix Cache #2295

cyx-6 · 2024-05-07T21:30:58Z

This PR introduces the prefix cache into serving engine, to manage prefix and accelerate prefill process.

tqchen · 2024-05-14T14:31:45Z

cpp/serve/request.h

+  /*!
+   * \brief The flag if request data is pinned in KVCache. Used for system prompt cache.
+   */
+  bool pinned = false;


move this to generation config

moved and added test for pinned system prompt.

cpp/serve/request_state.h

tqchen · 2024-05-14T14:38:01Z

cpp/serve/prefix_cache.h

+/*!
+ * \brief The matched result from prefix cache.
+ */
+struct MatchedResult {


PrefixCacheMatchedResult, given this namespace is serve and MatchedResult is a bit generic

cpp/serve/radix_tree.cc

tqchen · 2024-05-14T14:46:27Z

cpp/serve/engine_actions/eagle_new_request_prefill.cc

@@ -115,6 +121,9 @@ class EagleNewRequestPrefillActionObj : public BatchPrefillBaseActionObj {
        // Speculative models shift left the input tokens by 1 when base model has committed tokens.
        // Note: for n > 1 cases Eagle doesn't work because parent entry doesn't shift input tokens.
        for (int j = 0; j < static_cast<int>(input_data.size()); ++j) {
+          if (!model_id) {


model_id != 0 ?

actually, here does not matter if model == 0. And prefix cache only cares about. base model(model_id == 0), so we only update if model_id == 0. However, it will be changed if input data logic refactored soon, by using offset instead.

tqchen · 2024-05-14T14:48:43Z

python/mlc_llm/serve/request.py

@@ -33,6 +33,9 @@ class Request(Object):
        The JSON string of the default generation config.
        When a field in the input generation_config is not defined,
        we use the value in the default generation config.
+
+    pinned : bool


let us avoid exposing pinned for now to the user side

vinx13 · 2024-05-15T22:56:08Z

cpp/serve/engine.cc

@@ -97,6 +97,10 @@ class EngineImpl : public Engine {
      return TResult::Error(engine_config_res.UnwrapErr());
    }
    EngineConfig engine_config = engine_config_res.Unwrap();
+    n->estate_->prefix_cache = PrefixCache::Init(
+        engine_config->max_total_sequence_length / engine_config->kv_cache_page_size,
+        engine_config->kv_cache_page_size * 16, engine_config->max_num_sequence,


comment where 16 comes from

Agrree we should avoid magic numbers, have constant name

vinx13 · 2024-05-16T06:21:01Z

cpp/serve/engine_actions/eagle_new_request_prefill.cc

+   * tokens. If the request state entry is not added to KVCache yet, this method will add/fork the
+   * request in the KVCache, depending on the matching result from prefix cache.
+   * \param estate The engine state.
+   * \param[out] input The prefill input to be matched and updated.


Suggested change

* \param[out] input The prefill input to be matched and updated.

* \param[in,out] input The prefill input to be matched and updated.

there are also a few other places need updates

vinx13 · 2024-05-16T07:10:42Z

cpp/serve/engine_actions/action_commons.cc

@@ -50,8 +64,25 @@ void ProcessFinishedRequestStateEntries(std::vector<RequestStateEntry> finished_
      // So we mark the parent entry as finished.
      rstate->entries[parent_idx]->status = RequestStateStatus::kFinished;
      // Remove the request state entry from all the models.
-      RemoveRequestFromModel(estate, rstate->entries[parent_idx]->mstates[0]->internal_id, models);
-      estate->id_manager.RecycleId(rstate->entries[parent_idx]->mstates[0]->internal_id);
+      if (estate->prefix_cache->HasSequence(rstate->entries[parent_idx]->mstates[0]->internal_id)) {


This is very similar to the changes in line 32, it's better extracting to a function or lambda

making it a function with name

kripper · 2024-05-16T21:17:52Z

Are these ideas already considered?
See: #2353

comments addressed

cyx-6 · 2024-05-17T10:48:05Z

@kripper Thanks for suggestions, but I think these are not part of this PR. Although this PR does aim to improve multi-round chat, it is in a different way from AttentionStore.

kripper · 2024-05-17T11:20:28Z

@kripper Thanks for suggestions, but I think these are not part of this PR. Although this PR does aim to improve multi-round chat, it is in a different way from AttentionStore.

Ok. I will leave the feature request there.

tqchen · 2024-05-18T20:00:34Z

cpp/serve/engine.cc

@@ -97,6 +97,10 @@ class EngineImpl : public Engine {
      return TResult::Error(engine_config_res.UnwrapErr());
    }
    EngineConfig engine_config = engine_config_res.Unwrap();
+    n->estate_->prefix_cache = PrefixCache::Init(
+        engine_config->max_total_sequence_length / engine_config->kv_cache_page_size,
+        engine_config->kv_cache_page_size * 16, engine_config->max_num_sequence,


Agrree we should avoid magic numbers, have constant name

tqchen · 2024-05-18T20:06:23Z

cpp/serve/prefix_cache.h

+   * \param lazy The flag if the sequence should be removed lazily or intermediary.
+   * \throw Error if the given sequence id is not valid.
+   */
+  virtual void RecycleSequence(int64_t seq_id, PackedFunc callback, bool lazy = true) = 0;


add a signature to callback, in this can TypedPackedFunc is better

tqchen · 2024-05-18T20:07:01Z

cpp/serve/prefix_cache.h

+   * \param sliding_window_size The sliding window size, -1 for disabled sliding window.
+   * \param attention_sink_size The attention sink position for sliding window.
+   */
+  static PrefixCache Init(size_t num_pages, size_t page_size, size_t num_seqs,


Init => Create. In this case, perhaps we can change to constructor

naming:

num_seqs => max_num_seqs

num_pages => max_num_pages

tqchen · 2024-05-18T20:11:55Z

cpp/serve/engine_actions/action_commons.cc

+            }),
+            /*lazy=*/true);
+      }
+      // If the request is pinned, do nothing over the prefix cache and KVCache. Let the data be


We should not make data orphan, instead, add a new type of state(besides active etc, make the kind is SystemKeepAlive)

tqchen · 2024-05-18T20:12:48Z

cpp/serve/engine_actions/action_commons.cc

@@ -50,8 +64,25 @@ void ProcessFinishedRequestStateEntries(std::vector<RequestStateEntry> finished_
      // So we mark the parent entry as finished.
      rstate->entries[parent_idx]->status = RequestStateStatus::kFinished;
      // Remove the request state entry from all the models.
-      RemoveRequestFromModel(estate, rstate->entries[parent_idx]->mstates[0]->internal_id, models);
-      estate->id_manager.RecycleId(rstate->entries[parent_idx]->mstates[0]->internal_id);
+      if (estate->prefix_cache->HasSequence(rstate->entries[parent_idx]->mstates[0]->internal_id)) {


making it a function with name

tqchen · 2024-05-18T20:15:28Z

cpp/json_ffi/openai_api_protocol.h

@@ -148,6 +148,7 @@ class ChatCompletionRequest {
  std::optional<std::string> user = std::nullopt;
  bool ignore_eos = false;
  //   RequestResponseFormat response_format; //TODO: implement this
+  bool pinned = false;


Let us not expose pinned for now in the JSON FFI as it is not necessary

tqchen · 2024-05-18T20:16:07Z

cpp/json_ffi/json_ffi_engine.cc

@@ -84,7 +84,7 @@ bool JSONFFIEngine::AddRequest(std::string request_json_str, std::string request
                                  request.top_logprobs, request.logit_bias, request.seed,
                                  request.ignore_eos, request.max_tokens, std::move(stop_strs),
                                  conv_template_.stop_token_ids, /*response_format=*/std::nullopt,
-                                  this->default_generation_cfg_json_str_);
+                                  request.pinned, this->default_generation_cfg_json_str_);


json FFI should not expose pinned for now

tqchen · 2024-05-18T20:17:07Z

cpp/serve/config.h

@@ -50,6 +50,7 @@ class GenerationConfigNode : public Object {
  std::vector<int> stop_token_ids;

  ResponseFormat response_format;
+  bool pinned = false;


Let us add a sub structure, DebugConfig, which contains debug related options that should not be exposed to the end point.

Debug config should include two fields:

ignore_eos

pin_system_prompt

tqchen · 2024-05-18T20:20:09Z

python/mlc_llm/serve/config.py

@@ -93,6 +93,11 @@ class GenerationConfig:  # pylint: disable=too-many-instance-attributes

    response_format : ResponseFormat
        The response format of the generation output.
+
+    pinned : bool


add a subdataclass debug_config here. to include

pin_system_prompt

ignore_eos

tqchen · 2024-05-18T20:22:32Z

cpp/serve/prefix_cache.cc

+  /*!
+   * \brief The core data structure radix tree.
+   */
+  PagedRadixTree radix_tree;


Code style: class member must ends with underscore, PagedRadixTree radix_tree_, this is for better readability

tqchen · 2024-05-18T20:23:28Z

cpp/serve/prefix_cache.h

+   * \param sliding_window_size The sliding window size, -1 for disabled sliding window.
+   * \param attention_sink_size The attention sink position for sliding window.
+   */
+  static PrefixCache Init(size_t num_pages, size_t page_size, size_t num_seqs,


naming:

num_seqs => max_num_seqs

num_pages => max_num_pages

tqchen · 2024-05-18T20:39:41Z

cpp/serve/engine_actions/action_commons.cc

  NVTXScopedRange nvtx_scope("EngineAction postproc");
  std::vector<RequestStateEntry> finished_rsentries;
  finished_rsentries.reserve(requests.size());

  Array<RequestStreamOutput> callback_delta_outputs;
  callback_delta_outputs.reserve(requests.size());

+  for (Request request : requests) {
+    RequestState rstate = estate->GetRequestState(request);
+    for (const RequestStateEntry& rsentry : rstate->entries) {


lift to a sub function

tqchen · 2024-05-18T20:41:01Z

cpp/serve/engine_actions/eagle_new_request_prefill.cc

+   * \param estate The engine state.
+   * \param[out] input The prefill input to be matched and updated.
+   */
+  void MatchPrefixCache(EngineState estate, PrefillInput& input) final {


tqchen · 2024-05-18T20:45:45Z

cpp/serve/prefix_cache.h

+class PrefixCacheObj : public Object {
+ public:
+  /*!
+   * \brief Insert a new tokenized sequence into Prefix Cache.


more comments

This function updates the PrefixCache state, users must create a MatchAndReusePrefixCache function, that take the result and perform related update

tqchen · 2024-05-18T20:46:13Z

cpp/serve/prefix_cache.h

+   * \brief The parent sequence ID to fork in KVCache. The default value if -1, which means no
+   * forking operation needed.
+   */
+  int64_t parent_seq_id = -1;


fork_from_seq_id

tqchen · 2024-05-18T20:46:40Z

cpp/serve/prefix_cache.h

+   */
+  int64_t parent_seq_id = -1;
+  /*!
+   * \brief The matched prefix offset, which should be skipped when prefilling.


The matched prefix offset, which can be used to guide how to fork the parent seq

tqchen · 2024-05-18T20:48:42Z

cpp/serve/engine_actions/new_request_prefill.cc

+   * \param estate The engine state.
+   * \param[out] input The prefill input to be matched and updated.
+   */
+  void MatchPrefixCache(EngineState estate, PrefillInput& input) final {


MatchAndReusePrefixCahe

tqchen · 2024-05-18T21:05:01Z

cpp/serve/prefix_cache.cc

+   * \param sliding_window_size The sliding window size, -1 for disabled sliding window.
+   * \param attention_sink_size The attention sink position for sliding window.
+   */
+  explicit PrefixCacheImpl(size_t num_pages, size_t page_size, size_t num_seqs,


prefix_cahce_max_num_seqs

This PR introduces the prefix cache into serving engine, to manage prefix and accelerate prefill process.

tqchen · 2024-05-19T00:45:43Z

cpp/json_ffi/json_ffi_engine.cc

@@ -78,13 +78,19 @@ bool JSONFFIEngine::AddRequest(std::string request_json_str, std::string request
    }
  }

+  bool pinned_system_prompt = false;


we should pass in debug_config into serve::DebugConfig directly, instead of pas sby per value

tqchen · 2024-05-19T16:52:05Z

cpp/serve/config.h

@@ -50,6 +55,7 @@ class GenerationConfigNode : public Object {
  std::vector<int> stop_token_ids;

  ResponseFormat response_format;
+  DebugConfig debug_config = {false};


Make DebugConfig optional

tqchen · 2024-05-19T16:52:18Z

cpp/serve/config.h

@@ -30,6 +30,11 @@ struct ResponseFormat {
  Optional<String> schema = NullOpt;
 };

+/*! \brief The debug configuration of a request. */
+struct DebugConfig {
+  bool pinned_system_prompt = false;


Make DebugConfig an Object

tqchen · 2024-05-19T17:00:19Z

cpp/serve/radix_tree.cc

-  SequenceIDNode* raw_pool_;
-  /*! \brief The sequence ID node pool. */
+  /*! \brief The size of each node pool block. */
+  static constexpr size_t NODE_BLOCK_SIZE_ = 64;


kNodeBlockSize

cyx-6 force-pushed the prefix-cache-2 branch from 3e1d14d to f4c0f37 Compare May 7, 2024 22:06

This was referenced May 11, 2024

[Question] Caching the beginning of the prompt #1703

Closed

[Question] Save/load kv cache for faster load times? #2036

Open

vinx13 mentioned this pull request May 12, 2024

[Serving] Refactor to consolidate new request prefill #2329

Merged

cyx-6 force-pushed the prefix-cache-2 branch from 6bb642c to 124239d Compare May 12, 2024 12:29

tqchen reviewed May 14, 2024

View reviewed changes

tqchen previously requested changes May 14, 2024

View reviewed changes

cyx-6 force-pushed the prefix-cache-2 branch from 696cfa9 to 754a8be Compare May 15, 2024 22:38

vinx13 reviewed May 16, 2024

View reviewed changes

tqchen assigned vinx13 May 18, 2024

tqchen requested changes May 18, 2024

View reviewed changes

cyx-6 added 3 commits May 18, 2024 21:43

[Serving] Prefix Cache

5cf10e6

This PR introduces the prefix cache into serving engine, to manage prefix and accelerate prefill process.

fix

1235922

fix

74fcee5

cyx-6 force-pushed the prefix-cache-2 branch 2 times, most recently from 80f5d03 to 74fcee5 Compare May 18, 2024 23:08

tqchen reviewed May 19, 2024

View reviewed changes

debug config

494b692

cyx-6 force-pushed the prefix-cache-2 branch from 4016991 to 494b692 Compare May 19, 2024 01:46

auto expanding radix tree pools

14b42cf

tqchen reviewed May 19, 2024

View reviewed changes

cyx-6 added 3 commits May 20, 2024 08:21

fix

9313c58

prefix cache refactor

68a283e

update

2fabfbe

cyx-6 added 4 commits May 21, 2024 12:24

update

e4592ea

fix lint

9115d97

fix lint

cf28f0f

update

71440b1

tqchen approved these changes May 21, 2024

View reviewed changes

tqchen merged commit 5444fd5 into mlc-ai:main May 21, 2024
1 of 2 checks passed

tqchen mentioned this pull request May 26, 2024

[Doc] Python API KV/memory reset details absent #2426

Open

	* \param[out] input The prefill input to be matched and updated.
	* \param[in,out] input The prefill input to be matched and updated.

[Serving] Prefix Cache #2295

[Serving] Prefix Cache #2295

Conversation

cyx-6 commented May 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen May 18, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kripper commented May 16, 2024

cyx-6 commented May 17, 2024

kripper commented May 17, 2024

tqchen May 18, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen May 18, 2024 •

edited

tqchen May 18, 2024 •

edited