Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need Help Model Formatting For HelloLlamaLocal #388

Open
nsubordin81 opened this issue Mar 6, 2024 · 8 comments
Open

Need Help Model Formatting For HelloLlamaLocal #388

nsubordin81 opened this issue Mar 6, 2024 · 8 comments
Assignees
Labels

Comments

@nsubordin81
Copy link

nsubordin81 commented Mar 6, 2024

I have been running into some errors early on in the hellollamalocal.cpp example (and this I moved to because it was more accessible but I'm ultimately looking to work the RAG chatbot example). I've been trying to execute the steps in a separate repo from the notebook one at a time to get through each step of the recipe. Here is the sequence I've attempted to so far:

hardware: Apple M3 Max Macbook, 128 GiB Memory

  1. downloaded Llama2-70b from Meta.
  2. ran 'convert_llama_weights_to_hf.py' script as advised in llama-recipes that I would need hugging face checkpoints for all demos and examples in the repo. conversion was successful.
  3. in HelloLlamaLocal.ipynb, noticed that the 4th cell instructed to use llama.cpp to convert the model to gguf format, so attempted to do this with the model that I'd already converted to use huggingface checkpoints, and have been getting errors. Here is the cell from that notebook:
    `python
    Set up the Llama 2 model.

Replace <path-to-llama-gguf-file> with the path either to your downloaded quantized model file here,

or to the ggml-model-q4_0.gguf file built with the following commands:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
python3 -m pip install -r requirements.txt
python convert.py <path_to_your_downloaded_llama-2-13b_model>
./quantize <path_to_your_downloaded_llama-2-13b_model>/ggml-model-f16.gguf <path_to_your_downloaded_llama-2-13b_model>/ggml-model-q4_0.gguf q4_0

For more info see https://python.langchain.com/docs/integrations/llms/llamacpp
I cloned llama.cpp and installed requirements and ran the following from the llama.cpp root: python convert.py --outfile llama2-70b.gguf --vocab-type hfft ../huggingface/llama2_70`

and received the following logs/stacktrace:

Loading model file ../huggingface/llama2_70/pytorch_model-00001-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00001-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00002-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00003-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00004-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00005-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00006-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00007-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00008-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00009-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00010-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00011-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00012-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00013-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00014-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00015-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00016-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00017-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00018-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00019-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00020-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00021-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00022-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00023-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00024-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00025-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00026-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00027-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00028-of-00029.bin
Loading model file ../huggingface/llama2_70/pytorch_model-00029-of-00029.bin
params = Params(n_vocab=32000, n_embd=8192, n_layer=80, n_ctx=2048, n_ff=28672, n_head=64, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=10000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('../huggingface/llama2_70'))
Found vocab files: {'spm': PosixPath('../huggingface/llama2_70/tokenizer.model'), 'bpe': None, 'hfft': PosixPath('../huggingface/llama2_70/tokenizer.json')}
Loading vocab file PosixPath('../huggingface/llama2_70/tokenizer.json'), type 'hfft'
fname_tokenizer: ../huggingface/llama2_70
You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
Vocab info: <HfVocab with 32000 base tokens and 0 added tokens>
Special vocab info: <SpecialVocab with 0 merges, special tokens {'bos': 1, 'eos': 2, 'unk': 0}, add special tokens {'bos': True, 'eos': False}>

result = futures.pop(0).result()
         ^^^^^^^^^^^^^^^^^^^^^^^

File "/Users//.pyenv/versions/3.12.2/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/Users//.pyenv/versions/3.12.2/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/Users//.pyenv/versions/3.12.2/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users//code/llms/llama.cpp/convert.py", line 1101, in do_item
tensor = lazy_tensor.load().to_ggml()
^^^^^^^^^^^^^^^^^^
File "/Users//code/llms/llama.cpp/convert.py", line 649, in load
ret = self._load()
^^^^^^^^^^^^
File "/Users//code/llms/llama.cpp/convert.py", line 659, in load
return self.load().astype(data_type)
^^^^^^^^^^^
File "/Users//code/llms/llama.cpp/convert.py", line 649, in load
ret = self._load()
^^^^^^^^^^^^
File "/Users//code/llms/llama.cpp/convert.py", line 737, in load
return lazy_tensor.load().permute(n_head, n_head_kv)
^^^^^^^^^^^^^^^^^^
File "/Users//code/llms/llama.cpp/convert.py", line 649, in load
ret = self._load()
^^^^^^^^^^^^
File "/Users//code/llms/llama.cpp/convert.py", line 809, in load
return UnquantizedTensor(storage.load(storage_offset, elm_count).reshape(size))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users//code/llms/llama.cpp/convert.py", line 793, in load
fp = self.zip_file.open(info)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users//.pyenv/versions/3.12.2/lib/python3.12/zipfile/init.py", line 1643, in open
raise BadZipFile(f"Overlapped entries: {zinfo.orig_filename!r} (possible zip bomb)")
zipfile.BadZipFile: Overlapped entries: 'pytorch_model-00001-of-00029/data/11' (possible zip bomb)

I'm not trying to attack my own machine and I got these files 100% from Meta and converted them with code from this repo, guessing that this may be llama.cpp convert.py intended for me to use the model downloaded from Meta directly with the original checkpoints, but I wasn't sure because the overall README.md for this repo suggested as a first step converting my downloaded Llama2 model to use HF checkpoints.

If it is a documentation interpretation issue, that would be good to know and probably I'd raise that as an issue. if it is an issue with the way I'm attempting to convert this model to gguf format then I would really appreciate some assistance.

Thanks for making these open sourced and giving people like me a chance to get familiar with them, love the work you are doing!

@nsubordin81
Copy link
Author

I may have also arbitrarily selected to use a hugging face vocab even though llama2 may not be. was grasping at straws a bit to prevent from having to download another gigantic model. If the right thing to do is start with vanilla Meta Llama2 for this example I'm all about it, just wondering if someone can help me figure out what the right path is to save me trial and error.

@nsubordin81
Copy link
Author

Ok, going to close this because I think I'm asking in the wrong repo. So I don't think it was necessarily to do with my model checkpoints so much as my python version. I was using 3.12.2 given I'm on a new machine, and I think llama.cpp and the convert script specifically might have an issue with this because the zip utility in python may have new protections in it. that or the model file is vulnerable and the new core cPython is correctly protecting me if I weren't getting this file from a reputable source.

downgrading to 3.10.13 resolved my issue.

@nsubordin81
Copy link
Author

so I might have issues since I kept the Meta checkpoints in place. That part of the docs could actually be clearer. so keeping open just for that ambiguity to be addressed in this cell:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
python3 -m pip install -r requirements.txt
python convert.py <path_to_your_downloaded_llama-2-13b_model> # is this model using HF checkpoints as described in the overall README.md?
./quantize <path_to_your_downloaded_llama-2-13b_model>/ggml-model-f16.gguf <path_to_your_downloaded_llama-2-13b_model>/ggml-model-q4_0.gguf q4_0

@nsubordin81 nsubordin81 reopened this Mar 6, 2024
@nsubordin81
Copy link
Author

nsubordin81 commented Mar 8, 2024

to test my theory that I needed hf checkpoints I converted the llama 13b model to .gguf format and attempted to use it with this notebook. this model had not been converted to huggingface checkpoints yet. I was able to instantiate the 'llm' object with LlamaCpp runner. This worked successfully (after one failed attempt where I received a kernel failure for using partially failed .gguf model from my attempt with 3.12.2)! I am also trying with the 70b parameter model that was converted to use HF checkpoints. inference has been incredibly slow but also for this test I haven't been plugged into a power supply.

Going to probably re-close this issue since it looks like my only blocker was my use of the latest python version rather than checkpoints. If there are advantages to doing things one way or the other for this particular notebook then more prescriptive instructions would be good but up to the authors there.

@nsubordin81
Copy link
Author

I ran HelloLlamaLocal with llama2-70b, downloaded from Meta, converted to HF checkpoints, converted to gguf by llama.cpp and run with it. here are the inference metrics for an M3 Mac 128 GiB:

The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail, generally referred to as The Innovator's Dilemma, first published in 1997, is the most well-known work of the Harvard professor and businessman Clayton Christensen. The book focuses on disruptive innovation and why industry-leading businesses can fail "by doing everything right." It is one of the most well-known business books ever written.
In his book, Christensen examines how successful businesses can lose their market leadership when confronted with disruptive technologies. He argues that good management and operational efficiency are not enough to guarantee success. Instead, companies must be willing to embrace new technologies and business models, even if they threaten their existing business.
Christensen's theory of disruptive innovation has been widely influential and has been applied to a variety of industries. It has also been criticized by some for being too simplistic and for not taking into account other factors that can contribute to a company's success or failure.
Despite its critics, The Innovator's Dilemma remains one of

llama_print_timings: load time = 107660.07 ms
llama_print_timings: sample time = 31.55 ms / 256 runs ( 0.12 ms per token, 8113.59 tokens per second)
llama_print_timings: prompt eval time = 227701.84 ms / 15 tokens (15180.12 ms per token, 0.07 tokens per second)
llama_print_timings: eval time = 36132243.40 ms / 255 runs (141695.07 ms per token, 0.01 tokens per second)
llama_print_timings: total time = 36361330.68 ms / 270 tokens

This was very slow. I'm wondering if the issue has to do with the way I set up the model and all of those conversions or using the basic defaults from LLM CPP to run it, like I'm not taking advantage of all my hardware. I did some of the inference on battery and some with power supply. Was wondering if there were ways to do more detailed diagnostics on this. I was also thinking of trying ollama or other options and wondering if those would be successful. Thanks for any help!

@nsubordin81
Copy link
Author

As a test, I used ollama instead to pull the model from its registry and perform inference and this had results more aligned to my hardware, so I suspect I took a wrong turn somewhere between getting the model from Meta, converting its checkpoints and format and using this notebook to initialize it with LlamaCpp. I'm still curious about where I might have gone wrong, but also I have a workable solution now so I'm happy to close the issue as a one-off mistake on my end if others don't experience similar issues.

@jeffxtang
Copy link
Contributor

@nsubordin81 Thanks for all the updates. Running Llama2-70b locally with llama.cpp used to be at least slow (I haven't checked out their latest update). How fast were you able to run Llama 2 7b using Ollama?

@nsubordin81
Copy link
Author

so with Ollama + langchain I was able to get the following with a llama2 model, using a chain that retrieved embeddings from a vectorstore holding a single epub book (just did a simple time.time() before and after call to .invoke, did split() on the result string and manually divided):

num tokens in output = 395
total_time = 78.57125115394592
Tokens/Second = 5.027284079084729

by contrast, the setup I had with llama.cpp when I was trying to follow the local llama notebook as you can see above was more like the kind of thing where I had to get up and leave my machine eventually it was taking so long. I'd be curious just to know the core issue, though it may be hard to figure out from my metrics above without additional input.

Going back through the steps, trying to figure out where I could have diverged or whether my hardware and architecture require with the M3 Max special configuration that aren't called out in the docs or I didn't see in any case. It def. feels like I wasn't taking advantage of the capabilities of the machine, but it did run so that just leaves me guessing for the most part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants