Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate problems with ctranslate2 #75

Open
SebastianBodza opened this issue Aug 17, 2023 · 7 comments
Open

Investigate problems with ctranslate2 #75

SebastianBodza opened this issue Aug 17, 2023 · 7 comments
Labels
stuck Issue is stuck on something

Comments

@SebastianBodza
Copy link

SebastianBodza commented Aug 17, 2023

Is it possible to investigate the problems of ctranslate2 in more detail? The library is one of the fastest and supports token streaming. Unfortunately with beam search no token streaming is possible and there the performance is quite bad :/

Is there any way to run the interview locally?

P.s. in the readme cformers2 should be ctranslate2

@SebastianBodza
Copy link
Author

Shouldn't the parameter for the beamsize be beam_size instead of num_hypotheses?
def generate(self, prompt, params):
According to the following
https://opennmt.net/CTranslate2/python/ctranslate2.Generator.html#ctranslate2.Generator.generate_batch

@the-crypt-keeper
Copy link
Owner

the-crypt-keeper commented Aug 20, 2023

@SebastianBodza I really need to do a better job in the README, yes you can run everything locally.

Create the prompts with prepare.py --template prompts/Wizard-Coder.txt which should say Expanded 28 Wizard-Coder prompts to results/prepare_junior-v2_python-javascript_Wizard-Coder.ndjson

Now you can run the ctranslate2 (why does my brain refuse to remember this correctly ugh) interview with:

./interview_cuda.py --runtime ctranslate2 --model_name michaelfeil/ct2fast-WizardCoder-15B-V1.0 --params params/wizardcoder.json --input results/prepare_junior-v2_python-javascript_Wizard-Coder.ndjson

This will download the model from HF if it's not already cached. My initial observations when implementing this runtime in #62 were that if you try params/precise.json instead of params/wizardcoder.json the results were very different then what every other runtime produced with those settings (and not very good).

As to your second point: that's an interesting thought. There should be 2 paramters to beam searching, one for the number of beams to consider and another for the size or length of those beams. When I first went through the docs I left with the impression that the beam_size parameter is the beam length, while num_hypotheses is the number of beams but now I'm not so sure and it's possible I got them backwards. Like you mentioned, beam searching is slow (because each beam is effectively an inference stream) so I tend to stick to simpler sampling for evaluations just because of resource constraints.

@SebastianBodza
Copy link
Author

SebastianBodza commented Aug 21, 2023

Thanks for the clarification! I ran some tests locally.
I guess it is rather related to the repetition penalty. Without repetition_penalty and repeat_last_n.

Python Passed 85 of 91
JavaScript Passed 75 of 91

However it seems to also be a bit unstable. Another run with the same settings:

Python Passed 88 of 91
JavaScript Passed 82 of 91

For the beam_size i think you are right. num_hypotheses should be correct.

@the-crypt-keeper
Copy link
Owner

@SebastianBodza Yes something seems to be wrong with the implementation of repeat penalty in this runtime, but I haven't yet dived into the code to see whats up. This isnt normally a complex operation.

If you want to try it on something with repeat penalty that should be otherwise stable, that's the goal of params/greedy.json.

@the-crypt-keeper
Copy link
Owner

I've implemented batching and basic stop-seq support for this runtime, but batching seems to only make the instability problems here worse :/

I wonder if upstream issue #1425 is related and we have some unstable sort related issues happening here..

@guillaumekln
Copy link

Hi,

The issue related to the callback in batch mode should be fixed in ctranslate2>=3.19.0. The returned batch_ids were mixed up.

However, I'm not sure what is the issue with repetition penalty. For now I suggest forcing this value to 1 for CTranslate2 if this value works for you. In general repetition penalty should not be needed when using a random sampler.

@the-crypt-keeper
Copy link
Owner

@guillaumekln I am having trouble with this runtime following upgrade of my container to CUDA 12.1, it complains of RuntimeError: Library libcublas.so.11 is not found or cannot be loaded

Does ct2 only support CUDA 11 at this time?

@the-crypt-keeper the-crypt-keeper added the stuck Issue is stuck on something label Dec 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stuck Issue is stuck on something
Projects
None yet
Development

No branches or pull requests

3 participants