meta-llama · Kyriection · Mar 18, 2024 · Mar 18, 2024 · Mar 18, 2024 · Mar 18, 2024
diff --git a/recipes/experimental/long-context/H2O/README.md b/recipes/experimental/long-context/H2O/README.md
@@ -0,0 +1,90 @@
+## Run Llama with H2O for long context inference
+
+### Overview:
+
+Heavy-Hitter Oracle (H2O) is an efficient inference framework of LLMs. During the generative inference of transfomers, the size of KV cache grows linearly with the sequence length (prompt length + generation length) during long context generation. And the size KV cache is usually significantly larger than the model parameters, contrains the inference throughput. H2O identifies the critical KV pairs and evicts other unnecessary ones, maintaining a small cache size thus improving the throughput. 
+
+Besides, LLMs usually have poor generation to long sequence during inference. H2O handles this issue by maintaining only heavy-hitter tokens and the most recent tokens. Incorporated with the positional rolling strategy (reassigning the position of each kv with the position in the kv cache instead of the original sequence), H2O can process sequence length much longer than the pretrained context window. 
+
+Current implementation supports llama-1/2/3, from 7B to 70B. Since H2O only maintains the most important KV pairs, it might missing some important information in the middle content for some knowlege-intensive tasks.
+
+More details please refer to Paper: https://arxiv.org/pdf/2306.14048; Blog: https://allenz.work/?p=11.
+
+### Environments:
+
+transformers == 4.39.0
+
+### Evaluation on Summarization Tasks
+
+The following example runs inference of Llama-2-7b on XSUM summarization tasks. We're using `--enable_h2o_generation` to enable H2O algorithm that only keeps heavy-hitter and the local KV pairs. Use `--num_heavy_hitter_tokens` to decide the number of heavy-hitter KV pairs and `--num_window_length `for the KV cache size. The number of local KV pairs equals num_window_length - num_heavy_hitter_tokens. Also, use --enable_position_rolling to enable position rolling in the KV cache size that assign the positions in the KV cache instead of the ones in original sequences. Enabling postional rolling is important when sequence length exceeds the pretrained context windows, e.g., 4K in Llama-2.
+
+```
+python run_summarization.py \
+--input-path data/summarization/xsum.jsonl \
+--output-path summarization_output/xsum_h2o.jsonl \
+--model-name meta-llama/Llama-2-7b-hf \
+--enable_h2o_generation 
+```
+
+##### **Results**
+
+Expected results on XSUM (Rouge-2 score) from the above scripts on Llama-2/3 models. The sequence length of inputs are ~2k, thus KV cache size larger than 2048 represents the full cache performance.
+
+| KV Cache Size | 64     | 128    | 256    | 512    | 1024   | 2048   | 4096   | 8192   |
+| ------------- | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
+| Llama-2-7B    | 0.0439 | 0.1127 | 0.1148 | 0.1182 | 0.1170 | 0.1164 | 0.1164 | 0.1164 |
+| Llama-2-13B   | 0.1180 | 0.1217 | 0.1243 | 0.1291 | 0.1302 | 0.1332 | 0.1332 | 0.1332 |
+| Llama-3-8B    | 0.1107 | 0.1189 | 0.1200 | 0.1347 | 0.1290 | 0.1311 | 0.1311 | 0.1311 |
+
+### Evaluation on "Needle in a Haystack" Analysis
+
+The following example runs inference of Llama-3-8b-instruct on "Needle in a haystack" test. The test is modified from [https://github.com/gkamradt/LLMTest_NeedleInAHaystack](). Please follow the original repository for installing necessary packages. We're using `--enable_h2o_generation` to enable H2O algorithm that only keeps heavy-hitter and the local KV pairs. Use `--num_heavy_hitter_tokens` to decide the number of heavy-hitter KV pairs and `--num_window_length `for the KV cache size. The number of local KV pairs equals num_window_length - num_heavy_hitter_tokens. Also, use --enable_position_rolling to enable position rolling in the KV cache size that assign the positions in the KV cache instead of the ones in original sequences. Enabling postional rolling is important when sequence length exceeds the pretrained context windows, e.g., 4K in Llama-2.
+
+```
+# step 1: obtain prompts for evaluation
+python utils/needle_test/prompt.py --model_name meta-llama/Meta-Llama-3-8B-Instruct
+
+
+# step 2: generation predictions of each prompt
+# full model
+python run_needle_haystack_test.py \
+--input-path data/needle_test/Huggingface \
+--output-path needle_test_results/huggingface/llama-3-8b-instruct/ \
+--model-name meta-llama/Meta-Llama-3-8B-Instruct 
+
+# h2o with 2048 kv cache
+python run_needle_haystack_test.py \
+--input-path data/needle_test/Huggingface \
+--output-path needle_test_results/huggingface/llama-3-8b-instruct-h2o-4096/ \
+--model-name meta-llama/Meta-Llama-3-8B-Instruct \
+--enable_h2o_generation \
+--num_window_length 4096 \
+--num_heavy_hitter_tokens 2048
+
+
+# step 3: scoring with gpt4
+export OPENAI_API_KEY=YOUR_API_KEY
+python utils/needle_test/eval.py \
+--input-path needle_test_results/huggingface/llama-3-8b-instruct-h2o-4096\ #path for the prediction results
+--output-path needle_test_results/huggingface/llama-3-8b-instruct-h2o-4096_eval
+
+
+# step 4: visualization
+python utils/needle_test/vis.py \
+--input-path needle_test_results/huggingface/llama-3-8b-instruct-h2o-4096_eval
+```
+
+### One Demo on Streaming to "Infinite" Context Length
+
+The following example demonstrates the generation process of "infinite" sequence length. We use MT-Bench data and generate the context sample-by-sample. The KV Cache will keep the KV pairs from the previous samples while maintain a fixed size.
+
+```
+# run with full cache
+# expected results: 1) normal generation at the early stage; 2) performance collapse and generation slow down at the middle stage, because the sequence length exceeds the context window and the I/O cost of KV cache contrains the throughput; 3) OOM errors and stop.
+bash src/streaming.sh full
+
+# run with h2o
+# expected results: normal generation at all stage.
+# adjust the number of heavy-hitter tokens with --num_heavy_hitter_tokens and size of KV cache with --num_window_length in src/streaming.sh
+bash src/streaming.sh h2o
+```
diff --git a/recipes/experimental/long-context/H2O/data/PaulGrahamEssays/addiction.txt b/recipes/experimental/long-context/H2O/data/PaulGrahamEssays/addiction.txt
@@ -0,0 +1,116 @@
+July 2010What hard liquor, cigarettes, heroin, and crack have in common is
+that they're all more concentrated forms of less addictive predecessors.
+Most if not all the things we describe as addictive are.  And the
+scary thing is, the process that created them is accelerating.We wouldn't want to stop it.  It's the same process that cures
+diseases: technological progress.  Technological progress means
+making things do more of what we want.  When the thing we want is
+something we want to want, we consider technological progress good.
+If some new technique makes solar cells x% more efficient, that
+seems strictly better.  When progress concentrates something we
+don't want to want—when it transforms opium into heroin—it seems
+bad.  But it's the same process at work.
+[1]No one doubts this process is accelerating, which means increasing
+numbers of things we like will be transformed into things we like
+too much.
+[2]As far as I know there's no word for something we like too much.
+The closest is the colloquial sense of "addictive." That usage has
+become increasingly common during my lifetime.  And it's clear why:
+there are an increasing number of things we need it for.  At the
+extreme end of the spectrum are crack and meth.  Food has been
+transformed by a combination of factory farming and innovations in
+food processing into something with way more immediate bang for the
+buck, and you can see the results in any town in America.  Checkers
+and solitaire have been replaced by World of Warcraft and FarmVille.
+TV has become much more engaging, and even so it can't compete with Facebook.The world is more addictive than it was 40 years ago.   And unless
+the forms of technological progress that produced these things are
+subject to different laws than technological progress in general,
+the world will get more addictive in the next 40 years than it did
+in the last 40.The next 40 years will bring us some wonderful things.  I don't
+mean to imply they're all to be avoided.  Alcohol is a dangerous
+drug, but I'd rather live in a world with wine than one without.
+Most people can coexist with alcohol; but you have to be careful.
+More things we like will mean more things we have to be careful
+about.Most people won't, unfortunately.  Which means that as the world
+becomes more addictive, the two senses in which one can live a
+normal life will be driven ever further apart.  One sense of "normal"
+is statistically normal: what everyone else does.  The other is the
+sense we mean when we talk about the normal operating range of a
+piece of machinery: what works best.These two senses are already quite far apart.  Already someone
+trying to live well would seem eccentrically abstemious in most of
+the US.  That phenomenon is only going to become more pronounced.
+You can probably take it as a rule of thumb from now on that if
+people don't think you're weird, you're living badly.Societies eventually develop antibodies to addictive new things.
+I've seen that happen with cigarettes.  When cigarettes first
+appeared, they spread the way an infectious disease spreads through
+a previously isolated population.  Smoking rapidly became a
+(statistically) normal thing.  There were ashtrays everywhere.  We
+had ashtrays in our house when I was a kid, even though neither of
+my parents smoked.  You had to for guests.As knowledge spread about the dangers of smoking, customs changed.
+In the last 20 years, smoking has been transformed from something
+that seemed totally normal into a rather seedy habit: from something
+movie stars did in publicity shots to something small huddles of
+addicts do outside the doors of office buildings.  A lot of the
+change was due to legislation, of course, but the legislation
+couldn't have happened if customs hadn't already changed.It took a while though—on the order of 100 years.  And unless the
+rate at which social antibodies evolve can increase to match the
+accelerating rate at which technological progress throws off new
+addictions, we'll be increasingly unable to rely on customs to
+protect us.
+[3]
+Unless we want to be canaries in the coal mine
+of each new addiction—the people whose sad example becomes a
+lesson to future generations—we'll have to figure out for ourselves
+what to avoid and how.  It will actually become a reasonable strategy
+(or a more reasonable strategy) to suspect 
+everything new.In fact, even that won't be enough.  We'll have to worry not just
+about new things, but also about existing things becoming more
+addictive.  That's what bit me.  I've avoided most addictions, but
+the Internet got me because it became addictive while I was using
+it.
+[4]Most people I know have problems with Internet addiction.  We're
+all trying to figure out our own customs for getting free of it.
+That's why I don't have an iPhone, for example; the last thing I
+want is for the Internet to follow me out into the world.
+[5]
+My latest trick is taking long hikes.  I used to think running was a
+better form of exercise than hiking because it took less time.  Now
+the slowness of hiking seems an advantage, because the longer I
+spend on the trail, the longer I have to think without interruption.Sounds pretty eccentric, doesn't it?  It always will when you're
+trying to solve problems where there are no customs yet to guide
+you.  Maybe I can't plead Occam's razor; maybe I'm simply eccentric.
+But if I'm right about the acceleration of addictiveness, then this
+kind of lonely squirming to avoid it will increasingly be the fate
+of anyone who wants to get things done.  We'll increasingly be
+defined by what we say no to.
+Notes[1]
+Could you restrict technological progress to areas where you
+wanted it?  Only in a limited way, without becoming a police state.
+And even then your restrictions would have undesirable side effects.
+"Good" and "bad" technological progress aren't sharply differentiated,
+so you'd find you couldn't slow the latter without also slowing the
+former.  And in any case, as Prohibition and the "war on drugs"
+show, bans often do more harm than good.[2]
+Technology has always been accelerating.  By Paleolithic
+standards, technology evolved at a blistering pace in the Neolithic
+period.[3]
+Unless we mass produce social customs.  I suspect the recent
+resurgence of evangelical Christianity in the US is partly a reaction
+to drugs.  In desperation people reach for the sledgehammer; if
+their kids won't listen to them, maybe they'll listen to God.  But
+that solution has broader consequences than just getting kids to
+say no to drugs.  You end up saying no to 
+science as well.
+I worry we may be heading for a future in which only a few people
+plot their own itinerary through no-land, while everyone else books
+a package tour.  Or worse still, has one booked for them by the
+government.[4]
+People commonly use the word "procrastination" to describe
+what they do on the Internet.  It seems to me too mild to describe
+what's happening as merely not-doing-work.  We don't call it
+procrastination when someone gets drunk instead of working.[5]
+Several people have told me they like the iPad because it
+lets them bring the Internet into situations where a laptop would
+be too conspicuous.  In other words, it's a hip flask.  (This is
+true of the iPhone too, of course, but this advantage isn't as
+obvious because it reads as a phone, and everyone's used to those.)Thanks to Sam Altman, Patrick Collison, Jessica Livingston, and
+Robert Morris for reading drafts of this.