-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement H2O for long context inference on summarization tasks #411
base: main
Are you sure you want to change the base?
Changes from 101 commits
864f0d9
532e88b
da6f5eb
6ebd289
0216b20
e226672
a34a2e2
f50818a
77427ea
ae85ea9
9320185
4c08c68
4292fdb
53b917b
a346cd6
9788b1e
1245dca
266d5dc
07efbea
bc5d495
8422806
3bdf7db
4cbce19
bc1047e
a220760
b3571b7
eb8b28c
a6691c5
fe82f2a
905e546
e2a4c74
67f3666
71d579e
d3932c8
f5404d8
6b447d6
fb9373b
a0240ea
89103e5
d0eeba3
1a433bb
d9b9857
ad2ed66
5d6dc23
db9e842
1bd6480
9adb645
e024001
ab4eee2
7b619fb
749174d
cd94a39
60faf6f
a3e3c91
3dd3c8f
9b642b8
aba329b
e61b4c1
4676deb
9daddd4
4cbd593
d860109
0ba99ca
603cf5c
66bf383
7f007cb
228d710
dfd56f7
174d1c5
769e93e
89e576b
0993102
0824531
0dc84a4
9e2072f
8f955aa
008238b
36109c8
57d1f6d
036620e
0832c06
5b50ca5
84ddf52
ab07cdc
b33b68c
43e8599
a694fe8
5affb02
38cec86
9441e0c
f2802dd
9fb1080
428e8e8
cedb89b
115e930
525de54
7772734
0892bf4
28c811a
36eaf36
ec8842f
7ef694c
492eac7
61cdf88
636f874
1f11a37
15b6cc1
7084c1b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
## Run Llama with H2O for long context inference | ||
|
||
### Overview: | ||
|
||
Heavy-Hitter Oracle (H2O) is an efficient inference framework of LLMs. During the generative inference of transfomers, the size of KV cache grows linearly with the sequence length (prompt length + generation length) during long context generation. And the size KV cache is usually significantly larger than the model parameters, contrains the inference throughput. H2O identifies the critical KV pairs and evicts other unnecessary ones, maintaining a small cache size thus improving the throughput. | ||
|
||
Besides, LLMs usually have poor generation to long sequence during inference. H2O handles this issue by maintaining only heavy-hitter tokens and the most recent tokens. Incorporated with the positional rolling strategy (reassigning the position of each kv with the position in the kv cache instead of the original sequence), H2O can process sequence length much longer than the pretrained context window. | ||
|
||
Current implementation supports llama-1/2/3, from 7B to 70B. Since H2O only maintains the most important KV pairs, it might missing some important information in the middle content for some knowlege-intensive tasks. | ||
|
||
More details please refer to Paper: https://arxiv.org/pdf/2306.14048; Blog: https://allenz.work/?p=11. | ||
|
||
### Environments: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can make it as a "NOTE" to install the right version instead of environments as it seems to be only one package that version is important. |
||
|
||
transformers == 4.39.0 | ||
|
||
### Evaluation on Summarization Tasks | ||
|
||
The following example runs inference of Llama-2-7b on XSUM summarization tasks. We're using `--enable_h2o_generation` to enable H2O algorithm that only keeps heavy-hitter and the local KV pairs. Use `--num_heavy_hitter_tokens` to decide the number of heavy-hitter KV pairs and `--num_window_length `for the KV cache size. The number of local KV pairs equals num_window_length - num_heavy_hitter_tokens. Also, use --enable_position_rolling to enable position rolling in the KV cache size that assign the positions in the KV cache instead of the ones in original sequences. Enabling postional rolling is important when sequence length exceeds the pretrained context windows, e.g., 4K in Llama-2. | ||
|
||
``` | ||
python run_summarization.py \ | ||
--input-path data/summarization/xsum.jsonl \ | ||
--output-path summarization_output/xsum_h2o.jsonl \ | ||
--model-name meta-llama/Llama-2-7b-hf \ | ||
--enable_h2o_generation | ||
``` | ||
|
||
##### **Results** | ||
|
||
Expected results on XSUM (Rouge-2 score) from the above scripts on Llama-2/3 models. The sequence length of inputs are ~2k, thus KV cache size larger than 2048 represents the full cache performance. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Kyriection can you please add a bit more explanation on what is the baseline, how should someone look at Rouge? What does each kv cache size mean here. Also it seems for kv cache of 64 llama2-7b is significantly worse. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Lets highlight, that this brings throughput benefits on the inference time. |
||
|
||
| KV Cache Size | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | 8192 | | ||
| ------------- | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | | ||
| Llama-2-7B | 0.0439 | 0.1127 | 0.1148 | 0.1182 | 0.1170 | 0.1164 | 0.1164 | 0.1164 | | ||
| Llama-2-13B | 0.1180 | 0.1217 | 0.1243 | 0.1291 | 0.1302 | 0.1332 | 0.1332 | 0.1332 | | ||
| Llama-3-8B | 0.1107 | 0.1189 | 0.1200 | 0.1347 | 0.1290 | 0.1311 | 0.1311 | 0.1311 | | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it also would be great to highlight the idea of streaming LLMs here, to showcase the benefits in the inference time? |
||
### Evaluation on "Needle in a Haystack" Analysis | ||
|
||
The following example runs inference of Llama-3-8b-instruct on "Needle in a haystack" test. The test is modified from [https://github.com/gkamradt/LLMTest_NeedleInAHaystack](). Please follow the original repository for installing necessary packages. We're using `--enable_h2o_generation` to enable H2O algorithm that only keeps heavy-hitter and the local KV pairs. Use `--num_heavy_hitter_tokens` to decide the number of heavy-hitter KV pairs and `--num_window_length `for the KV cache size. The number of local KV pairs equals num_window_length - num_heavy_hitter_tokens. Also, use --enable_position_rolling to enable position rolling in the KV cache size that assign the positions in the KV cache instead of the ones in original sequences. Enabling postional rolling is important when sequence length exceeds the pretrained context windows, e.g., 4K in Llama-2. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we please add some numbers here vs baseline see how its working? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if for the data, specially Paul Graham. Essays make sense to use a script to download them or its easier this way? specially if there are some postprocessing is involved might be easier to have it as is. |
||
|
||
``` | ||
# step 1: obtain prompts for evaluation | ||
python utils/needle_test/prompt.py --model_name meta-llama/Meta-Llama-3-8B-Instruct | ||
|
||
|
||
# step 2: generation predictions of each prompt | ||
# full model | ||
python run_needle_haystack_test.py \ | ||
--input-path data/needle_test/Huggingface \ | ||
--output-path needle_test_results/huggingface/llama-3-8b-instruct/ \ | ||
--model-name meta-llama/Meta-Llama-3-8B-Instruct | ||
|
||
# h2o with 2048 kv cache | ||
python run_needle_haystack_test.py \ | ||
--input-path data/needle_test/Huggingface \ | ||
--output-path needle_test_results/huggingface/llama-3-8b-instruct-h2o-4096/ \ | ||
--model-name meta-llama/Meta-Llama-3-8B-Instruct \ | ||
--enable_h2o_generation \ | ||
--num_window_length 4096 \ | ||
--num_heavy_hitter_tokens 2048 | ||
|
||
|
||
# step 3: scoring with gpt4 | ||
export OPENAI_API_KEY=YOUR_API_KEY | ||
python utils/needle_test/eval.py \ | ||
--input-path needle_test_results/huggingface/llama-3-8b-instruct-h2o-4096\ #path for the prediction results | ||
--output-path needle_test_results/huggingface/llama-3-8b-instruct-h2o-4096_eval | ||
|
||
|
||
# step 4: visualization | ||
python utils/needle_test/vis.py \ | ||
--input-path needle_test_results/huggingface/llama-3-8b-instruct-h2o-4096_eval | ||
``` | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we please add some explanation to clarify specially on the tested length 1M. |
||
### One Demo on Streaming to "Infinite" Context Length | ||
|
||
The following example demonstrates the generation process of "infinite" sequence length. We use MT-Bench data and generate the context sample-by-sample. The KV Cache will keep the KV pairs from the previous samples while maintain a fixed size. | ||
|
||
``` | ||
# run with full cache | ||
# expected results: 1) normal generation at the early stage; 2) performance collapse and generation slow down at the middle stage, because the sequence length exceeds the context window and the I/O cost of KV cache contrains the throughput; 3) OOM errors and stop. | ||
bash src/streaming.sh full | ||
|
||
# run with h2o | ||
# expected results: normal generation at all stage. | ||
# adjust the number of heavy-hitter tokens with --num_heavy_hitter_tokens and size of KV cache with --num_window_length in src/streaming.sh | ||
bash src/streaming.sh h2o | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
July 2010What hard liquor, cigarettes, heroin, and crack have in common is | ||
that they're all more concentrated forms of less addictive predecessors. | ||
Most if not all the things we describe as addictive are. And the | ||
scary thing is, the process that created them is accelerating.We wouldn't want to stop it. It's the same process that cures | ||
diseases: technological progress. Technological progress means | ||
making things do more of what we want. When the thing we want is | ||
something we want to want, we consider technological progress good. | ||
If some new technique makes solar cells x% more efficient, that | ||
seems strictly better. When progress concentrates something we | ||
don't want to want—when it transforms opium into heroin—it seems | ||
bad. But it's the same process at work. | ||
[1]No one doubts this process is accelerating, which means increasing | ||
numbers of things we like will be transformed into things we like | ||
too much. | ||
[2]As far as I know there's no word for something we like too much. | ||
The closest is the colloquial sense of "addictive." That usage has | ||
become increasingly common during my lifetime. And it's clear why: | ||
there are an increasing number of things we need it for. At the | ||
extreme end of the spectrum are crack and meth. Food has been | ||
transformed by a combination of factory farming and innovations in | ||
food processing into something with way more immediate bang for the | ||
buck, and you can see the results in any town in America. Checkers | ||
and solitaire have been replaced by World of Warcraft and FarmVille. | ||
TV has become much more engaging, and even so it can't compete with Facebook.The world is more addictive than it was 40 years ago. And unless | ||
the forms of technological progress that produced these things are | ||
subject to different laws than technological progress in general, | ||
the world will get more addictive in the next 40 years than it did | ||
in the last 40.The next 40 years will bring us some wonderful things. I don't | ||
mean to imply they're all to be avoided. Alcohol is a dangerous | ||
drug, but I'd rather live in a world with wine than one without. | ||
Most people can coexist with alcohol; but you have to be careful. | ||
More things we like will mean more things we have to be careful | ||
about.Most people won't, unfortunately. Which means that as the world | ||
becomes more addictive, the two senses in which one can live a | ||
normal life will be driven ever further apart. One sense of "normal" | ||
is statistically normal: what everyone else does. The other is the | ||
sense we mean when we talk about the normal operating range of a | ||
piece of machinery: what works best.These two senses are already quite far apart. Already someone | ||
trying to live well would seem eccentrically abstemious in most of | ||
the US. That phenomenon is only going to become more pronounced. | ||
You can probably take it as a rule of thumb from now on that if | ||
people don't think you're weird, you're living badly.Societies eventually develop antibodies to addictive new things. | ||
I've seen that happen with cigarettes. When cigarettes first | ||
appeared, they spread the way an infectious disease spreads through | ||
a previously isolated population. Smoking rapidly became a | ||
(statistically) normal thing. There were ashtrays everywhere. We | ||
had ashtrays in our house when I was a kid, even though neither of | ||
my parents smoked. You had to for guests.As knowledge spread about the dangers of smoking, customs changed. | ||
In the last 20 years, smoking has been transformed from something | ||
that seemed totally normal into a rather seedy habit: from something | ||
movie stars did in publicity shots to something small huddles of | ||
addicts do outside the doors of office buildings. A lot of the | ||
change was due to legislation, of course, but the legislation | ||
couldn't have happened if customs hadn't already changed.It took a while though—on the order of 100 years. And unless the | ||
rate at which social antibodies evolve can increase to match the | ||
accelerating rate at which technological progress throws off new | ||
addictions, we'll be increasingly unable to rely on customs to | ||
protect us. | ||
[3] | ||
Unless we want to be canaries in the coal mine | ||
of each new addiction—the people whose sad example becomes a | ||
lesson to future generations—we'll have to figure out for ourselves | ||
what to avoid and how. It will actually become a reasonable strategy | ||
(or a more reasonable strategy) to suspect | ||
everything new.In fact, even that won't be enough. We'll have to worry not just | ||
about new things, but also about existing things becoming more | ||
addictive. That's what bit me. I've avoided most addictions, but | ||
the Internet got me because it became addictive while I was using | ||
it. | ||
[4]Most people I know have problems with Internet addiction. We're | ||
all trying to figure out our own customs for getting free of it. | ||
That's why I don't have an iPhone, for example; the last thing I | ||
want is for the Internet to follow me out into the world. | ||
[5] | ||
My latest trick is taking long hikes. I used to think running was a | ||
better form of exercise than hiking because it took less time. Now | ||
the slowness of hiking seems an advantage, because the longer I | ||
spend on the trail, the longer I have to think without interruption.Sounds pretty eccentric, doesn't it? It always will when you're | ||
trying to solve problems where there are no customs yet to guide | ||
you. Maybe I can't plead Occam's razor; maybe I'm simply eccentric. | ||
But if I'm right about the acceleration of addictiveness, then this | ||
kind of lonely squirming to avoid it will increasingly be the fate | ||
of anyone who wants to get things done. We'll increasingly be | ||
defined by what we say no to. | ||
Notes[1] | ||
Could you restrict technological progress to areas where you | ||
wanted it? Only in a limited way, without becoming a police state. | ||
And even then your restrictions would have undesirable side effects. | ||
"Good" and "bad" technological progress aren't sharply differentiated, | ||
so you'd find you couldn't slow the latter without also slowing the | ||
former. And in any case, as Prohibition and the "war on drugs" | ||
show, bans often do more harm than good.[2] | ||
Technology has always been accelerating. By Paleolithic | ||
standards, technology evolved at a blistering pace in the Neolithic | ||
period.[3] | ||
Unless we mass produce social customs. I suspect the recent | ||
resurgence of evangelical Christianity in the US is partly a reaction | ||
to drugs. In desperation people reach for the sledgehammer; if | ||
their kids won't listen to them, maybe they'll listen to God. But | ||
that solution has broader consequences than just getting kids to | ||
say no to drugs. You end up saying no to | ||
science as well. | ||
I worry we may be heading for a future in which only a few people | ||
plot their own itinerary through no-land, while everyone else books | ||
a package tour. Or worse still, has one booked for them by the | ||
government.[4] | ||
People commonly use the word "procrastination" to describe | ||
what they do on the Internet. It seems to me too mild to describe | ||
what's happening as merely not-doing-work. We don't call it | ||
procrastination when someone gets drunk instead of working.[5] | ||
Several people have told me they like the iPad because it | ||
lets them bring the Internet into situations where a laptop would | ||
be too conspicuous. In other words, it's a hip flask. (This is | ||
true of the iPhone too, of course, but this advantage isn't as | ||
obvious because it reads as a phone, and everyone's used to those.)Thanks to Sam Altman, Patrick Collison, Jessica Livingston, and | ||
Robert Morris for reading drafts of this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets clearly state the difference between this work and other long context work which are based on training and they publish a checkpoint. Here main idea is about a kv cache policy.