-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requires locally run scripts #52
Comments
Hi @Georgepitt, thanks for your interest in our work. We also faced this issue during our model development, PEFT library pushed a fix 2 months ago so the latest version should support offline loading. Here is what I have tried on my end
from huggingface_hub import snapshot_download
snapshot_download(repo_id="McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp")
# this command will return local address, lets call it <MNTP_LOCAL_PATH>
snapshot_download(repo_id="McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-unsup-simcse")
# this command will return local address, lets call it <SIMCSE_LOCAL_PATH>
HF_HUB_OFFLINE=1 python
import torch
from llm2vec import LLM2Vec
l2v = LLM2Vec.from_pretrained(
"<MNTP_LOCAL_PATH>",
peft_model_name_or_path="<SIMCSE_LOCAL_PATH>",
device_map="cuda" if torch.cuda.is_available() else "cpu",
torch_dtype=torch.bfloat16,
) Here are details of relevant library versions in my environment huggingface-hub 0.22.2
peft 0.10.0
transformers 4.40.1 Let me know if you have any further questions. |
Hello! I'm in a similar boat. I tried running your script, but for the llama-3-8b model and am having issues as well. I run the following (without internet connection): import torch
from llm2vec import LLM2Vec
# https://github.com/McGill-NLP/llm2vec/issues/52
l2v = LLM2Vec.from_pretrained(
"<LOCAL PATH to McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp>",
peft_model_name_or_path="<LOCAL PATH to https://huggingface.co/McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-unsup-simcse>",
device_map="cuda" if torch.cuda.is_available() else "cpu",
torch_dtype=torch.bfloat16,
) However, I still get the following:
I understand that I still need the underlying llama-3 model from Meta (I do have access to it), but I don't know how to link to where I have that model stored locally. Is there a simple fix? Thank you!! |
Thank you for your advice!@vaibhavad. I've followed the advice, but it still doesn't work.I don't know what went wrong. Could you give me some advice? Thank you! download the models: running script: and than it return this errors: |
I believe you also need to download HF_HOME=<CACHE_DIR> python -c "import transformers; transformers.AutoModel.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')" Make sure to specify CACHE_DIR as a directory which is accessible to you in the offline mode After this if you launch python with HF_HOME=<CACHE_DIR> HF_HUB_OFFLINE=1 python then all model loading should work as expected. Let me know iff these steps fix your issue |
Thank you very much for your help@vaibhavad! I successfully run the project locally. This is my setup: One unusual thing is that I can only specify one gpu to run, otherwise it will load the model repeatedly in the l2v.encode section. |
@Georgepitt, glad to know the issue is resolved.
I did not fully understand this. Can you provide more details? By default, encode tries to use all the GPUs available. |
It's possible that I have the same issue. In order to run your code on my system, I have to comment out lines 341-361 in llm2vec.py. If I don't comment out lines 341-361, the following output occurs. Also, I know this output is information overload. If you could direct me to some specific outputs / logs you need to better understand the issue, I will follow up with those details. Nvidia-SMI for CUDA:0 Device (all other devices on my machine are still unused):
Terminal outputs:
|
Hi @SouLeo, can you share your As a reference, our multi-GPU encoding implementation is similar to sentence-transformers library implementation |
Sure thing! Here is from llm2vec import LLM2Vec
import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel
from mteb import MTEB
MODEL_NAME = "McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp"
# Loading base Mistral model, along with custom code that enables bidirectional connections in decoder-only LLMs. MNTP LoRA weights are merged into the base model.
tokenizer = AutoTokenizer.from_pretrained(
MODEL_NAME
)
config = AutoConfig.from_pretrained(
MODEL_NAME, trust_remote_code=True
)
model = AutoModel.from_pretrained(
MODEL_NAME,
trust_remote_code=True,
config=config,
torch_dtype=torch.float16,
device_map="cuda" if torch.cuda.is_available() else "cpu",
)
model = PeftModel.from_pretrained(
model,
MODEL_NAME,
)
model = model.merge_and_unload() # This can take several minutes on cpu
# Loading unsupervised SimCSE model. This loads the trained LoRA weights on top of MNTP model. Hence the final weights are -- Base model + MNTP (LoRA) + SimCSE (LoRA).
model = PeftModel.from_pretrained(
model, "McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised"
)
# Wrapper for encoding and pooling operations
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
# model_name = "llama3"
# evaluation = MTEB(tasks=["Banking77Classification"])
# results = evaluation.run(l2v, output_folder=f"results/{model_name}")
# Encoding queries using instructions
instruction = (
"Given a web search query, retrieve relevant passages that answer the query:"
)
queries = [
[instruction, "how much protein should a female eat"],
[instruction, "summit define"],
]
q_reps = l2v.encode(queries)
# Encoding documents. Instruction are not required for documents
documents = [
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments.",
]
d_reps = l2v.encode(documents)
# Compute cosine similarity
q_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))
print(cos_sim)
"""
tensor([[0.6470, 0.1619],
[0.0786, 0.5844]])
""" CPU RAM: (base) slwanna@lepp:~$ free -g
total used free shared buff/cache available
Mem: 1510 9 1338 0 163 1493
Swap: 1 1 0 I have 8 NVIDIA RTX A5000 on my node. I will also look into the implementation you linked. |
Hi all, I have moved my code to an internet-connected, server with 8xH100's. I'm having similar issues with your multigpu .encode() function. See below. I'm still investigating this and don't want to see this issue get stale. But, I just wanted to double check that you had tested your encode function on multigpu systems. I have run the following sentence transformers model as a test and had no issues: from sentence_transformers import SentenceTransformer
model = SentenceTransformer("hkunlp/instructor-xl")
# Our sentences we like to encode
sentences = [
"This framework generates embeddings for each input sentence",
"Sentences are passed as a list of string.",
"The quick brown fox jumps over the lazy dog.",
]
# Sentences are encoded by calling model.encode()
sentence_embeddings = model.encode(sentences)
# Print the embeddings
for sentence, embedding in zip(sentences, sentence_embeddings):
print("Sentence:", sentence)
print("Embedding:", embedding)
print("") However, when I run import datasets
import torch
from llm2vec import LLM2Vec
# from beir import util
# from beir.datasets.data_loader import GenericDataLoader as BeirDataLoader
import os
from typing import Dict, List
# from beir.retrieval.evaluation import EvaluateRetrieval
dataset_name = "mteb/scidocs"
instruction = "Given a scientific paper title, retrieve paper abstracts that are cited by the given paper: "
print("Loading dataset...")
queries = datasets.load_dataset(dataset_name, "queries")
corpus = datasets.load_dataset(dataset_name, "corpus")
batch_size = 2
print("Loading model...")
model = LLM2Vec.from_pretrained(
"McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp",
peft_model_name_or_path="McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised",
device_map="cuda" if torch.cuda.is_available() else "cpu",
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
)
def append_instruction(instruction, sentences):
new_sentences = []
for s in sentences:
new_sentences.append([instruction, s, 0])
return new_sentences
def cos_sim(a: torch.Tensor, b: torch.Tensor):
if not isinstance(a, torch.Tensor):
a = torch.tensor(a)
if not isinstance(b, torch.Tensor):
b = torch.tensor(b)
if len(a.shape) == 1:
a = a.unsqueeze(0)
if len(b.shape) == 1:
b = b.unsqueeze(0)
a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
return torch.mm(a_norm, b_norm.transpose(0, 1))
def encode_queries(queries: List[str], batch_size: int, **kwargs):
new_sentences = append_instruction(instruction, queries)
kwargs['show_progress_bar'] = False
return model.encode(new_sentences, batch_size=batch_size, **kwargs)
def encode_corpus(corpus: List[Dict[str, str]], batch_size: int, **kwargs):
if type(corpus) is dict:
sentences = [
(corpus["title"][i] + ' ' + corpus["text"][i]).strip()
if "title" in corpus
else corpus["text"][i].strip()
for i in range(len(corpus["text"]))
]
else:
sentences = [
(doc["title"] + ' ' + doc["text"]).strip() if "title" in doc else doc["text"].strip()
for doc in corpus
]
new_sentences = append_instruction("", sentences)
return model.encode(new_sentences, batch_size=batch_size, **kwargs)
print("Encoding Queries...")
query_ids = list(queries.keys())
results = {qid: {} for qid in query_ids}
queries = [queries[qid] for qid in queries]
query_embeddings = encode_queries(queries[0]['text'][:2], batch_size=batch_size, show_progress_bar=True, convert_to_tensor=True) I again get errors:
|
Hi @SouLeo, Apologies for delay in the response. When you run Regarding running multi-GPU with LLM2Vec, the code need to be shielded with I have modified your script and verified that it runs on 8xH100 server. import datasets
import torch
from llm2vec import LLM2Vec
# from beir import util
# from beir.datasets.data_loader import GenericDataLoader as BeirDataLoader
import os
from typing import Dict, List
# from beir.retrieval.evaluation import EvaluateRetrieval
def append_instruction(instruction, sentences):
new_sentences = []
for s in sentences:
new_sentences.append([instruction, s, 0])
return new_sentences
def cos_sim(a: torch.Tensor, b: torch.Tensor):
if not isinstance(a, torch.Tensor):
a = torch.tensor(a)
if not isinstance(b, torch.Tensor):
b = torch.tensor(b)
if len(a.shape) == 1:
a = a.unsqueeze(0)
if len(b.shape) == 1:
b = b.unsqueeze(0)
a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
return torch.mm(a_norm, b_norm.transpose(0, 1))
def encode_queries(queries: List[str], batch_size: int, **kwargs):
new_sentences = append_instruction(instruction, queries)
kwargs['show_progress_bar'] = False
return model.encode(new_sentences, batch_size=batch_size, **kwargs)
def encode_corpus(corpus: List[Dict[str, str]], batch_size: int, **kwargs):
if type(corpus) is dict:
sentences = [
(corpus["title"][i] + ' ' + corpus["text"][i]).strip()
if "title" in corpus
else corpus["text"][i].strip()
for i in range(len(corpus["text"]))
]
else:
sentences = [
(doc["title"] + ' ' + doc["text"]).strip() if "title" in doc else doc["text"].strip()
for doc in corpus
]
new_sentences = append_instruction("", sentences)
return model.encode(new_sentences, batch_size=batch_size, **kwargs)
if __name__ == "__main__":
dataset_name = "mteb/scidocs"
instruction = "Given a scientific paper title, retrieve paper abstracts that are cited by the given paper: "
print("Loading dataset...")
queries = datasets.load_dataset(dataset_name, "queries")
corpus = datasets.load_dataset(dataset_name, "corpus")
batch_size = 2
print("Loading model...")
model = LLM2Vec.from_pretrained(
"McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp",
peft_model_name_or_path="McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised",
device_map="cuda" if torch.cuda.is_available() else "cpu",
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
)
print("Encoding Queries...")
query_ids = list(queries.keys())
results = {qid: {} for qid in query_ids}
queries = [queries[qid] for qid in queries]
query_embeddings = encode_queries(queries[0]['text'][:2], batch_size=batch_size, show_progress_bar=True, convert_to_tensor=True) Please check if this script is working on your end, and feel free to ask any other question. |
@vaibhavad I took a little snippet about how when I specify two Gpus to run, the model loads repeatedly. logs:
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] 0%| | 0/1 [00:00<?, ?it/s] Start loading model |
Hello @Georgepitt, Can you share the |
Of course! Here is emxample.sh
test_example.py
|
Hi @Georgepitt, please refer to my response above
You'll need to modify |
Hello, the computing cluster provided by the lab needs to run offline. But the code in usage needs to be networked. I have changed the code to the offline version, but it still gives errors. Can you give me some help, please?
usage code:
Loading base Mistral model, along with custom code that enables bidirectional connections in decoder-only LLMs.
tokenizer = AutoTokenizer.from_pretrained(
"McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp"
)
config = AutoConfig.from_pretrained(
"McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp", trust_remote_code=True
)
model = AutoModel.from_pretrained(
"McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp",
trust_remote_code=True,
config=config,
torch_dtype=torch.bfloat16,
device_map="cuda" if torch.cuda.is_available() else "cpu",
)
Loading MNTP (Masked Next Token Prediction) model.
model = PeftModel.from_pretrained(
model,
"McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp",
)
Modified code:
local_base_model_path = "/home/McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp"
tokenizer = AutoTokenizer.from_pretrained(local_base_model_path)
config = AutoConfig.from_pretrained(local_base_model_path)
model = AutoModel.from_pretrained(local_base_model_path, config=config,torch_dtype=torch.bfloat16, local_files_only=True)
print(4)
errors:
Traceback (most recent call last):
File "/share/home/chenyuxuan/Llama3_8b_s.py", line 59, in
model = AutoModel.from_pretrained(local_model_path, config=config,torch_dtype=torch.bfloat16, local_files_only=True)
File "/share/home/chenyuxuan/.conda/envs/LLM2Vec/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File "/share/home/chenyuxuan/.conda/envs/LLM2Vec/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3385, in from_pretrained
if has_file(pretrained_model_name_or_path, TF2_WEIGHTS_NAME, **has_file_kwargs):
File "/share/home/chenyuxuan/.conda/envs/LLM2Vec/lib/python3.8/site-packages/transformers/utils/hub.py", line 627, in has_file
r = requests.head(url, headers=headers, allow_redirects=False, proxies=proxies, timeout=10)
File "/share/home/chenyuxuan/.conda/envs/LLM2Vec/lib/python3.8/site-packages/requests/api.py", line 100, in head
return request("head", url, **kwargs)
File "/share/home/chenyuxuan/.conda/envs/LLM2Vec/lib/python3.8/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/share/home/chenyuxuan/.conda/envs/LLM2Vec/lib/python3.8/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/share/home/chenyuxuan/.conda/envs/LLM2Vec/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/share/home/chenyuxuan/.conda/envs/LLM2Vec/lib/python3.8/site-packages/requests/adapters.py", line 532, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)
The text was updated successfully, but these errors were encountered: