Implement H2O for long context inference on summarization tasks #411

Kyriection · 2024-03-24T23:01:11Z

This is add the implementation of H2O algorithm for efficient long context inference of Llama models.
Current implementations are based on the Huggingface transformers and tests on summarization tasks, including XSUM and CNN-DailyMail

facebook-github-bot · 2024-03-24T23:01:17Z

Hi @Kyriection!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

facebook-github-bot · 2024-03-25T01:07:15Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

HamidShojanazeri

Thanks a lot @Kyriection for the PR!! just added some quick initial thoughts on the PR. will go deeper on the second round.

HamidShojanazeri · 2024-03-25T20:26:37Z

research/long-context-llama/H2O/README.md

+
+The following example runs inference of Llama-2-7b on XSUM summarization tasks. We're using `--enable_h2o_generation` to enable H2O algorithm that only keeps heavy-hitter and the local KV pairs. Use `--num_heavy_hitter_tokens` to decide the number of heavy-hitter KV pairs and `--num_window_length `for the KV cache size. The number of local KV pairs equals num_window_length - num_heavy_hitter_tokens. Also, use --enable_position_rolling to enable position rolling in the KV cache size that assign the positions in the KV cache instead of the ones in original sequences. Enabling postional rolling is important when sequence length exceeds the pretrained context windows, e.g., 4K in Llama-2.
+
+```


running the code it runs into this error seems one arg is missing,

File "/data/users/hamidnazeri/fbsource/llama-recipe-new/llama-recipes/research/long-context-llama/H2O/utils/llama.py", line 290, in enable_h2ocache_forward causal_mask = self._update_causal_mask(attention_mask, inputs_embeds, cache_position) TypeError: LlamaModel._update_causal_mask() takes 3 positional arguments but 4 were given

can we also start adding numbers/ visuals how it looks/ what it improves/?

can you please try moving the folder under recipes/experimental/long-context/H2O.

HamidShojanazeri · 2024-03-25T20:28:12Z

research/long-context-llama/H2O/README.md

@@ -0,0 +1,11 @@
+## Run Llama with H2O for long context inference
+


can you please add a link to the paper, blog post ( your name as the contributor/ anyone else you like to add) and also brief summary of

what is H2O

how it works

what are advantage of it

which models it supports (llama7, 13b 70B?) it seems we need to work with ecosystem projects like VLLM etc to get it done for multi-gpus?

potential limitations

Hi, @HamidShojanazeri. I will updated the PR by adding the implementation of Needle in a Haystack analysis, one example of inference with growing sequence length, as well as results on Llama-2/3. Please check the updated PR. Thanks!

HamidShojanazeri · 2024-04-29T14:25:04Z