Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] mlc_llm serve fails with --speculative-mode, does it require certain hardware? #2350

Closed
0xDEADFED5 opened this issue May 16, 2024 · 2 comments
Labels
question Question about the usage

Comments

@0xDEADFED5
Copy link

0xDEADFED5 commented May 16, 2024

using nightly wheels. i can serve just fine with --speculative-mode disable, but all the other options give me this:

Exception in thread Thread-11 (_background_loop):
Traceback (most recent call last):
  File "C:\Users\ANON\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "C:\Users\ANON\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\ANON\repos\AI_Grotto\mlcvenv\Lib\site-packages\mlc_llm\serve\engine_base.py", line 482, in _background_loop
    self._ffi["run_background_loop"]()
  File "C:\Users\ANON\repos\AI_Grotto\mlcvenv\Lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "C:\Users\ANON\repos\AI_Grotto\mlcvenv\Lib\site-packages\tvm\_ffi\base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  File "D:\a\package\package\mlc-llm\cpp\serve\engine.cc", line 145
InternalError: Check failed: n->models_.size() > 1U (1 vs. 1) :

does speculative-mode have other requirements?
OS: Windows 11, HW: Intel Arc A770
thanks for the great project, btw.

@0xDEADFED5 0xDEADFED5 added the question Question about the usage label May 16, 2024
@MasterJH5574
Copy link
Collaborator

MasterJH5574 commented May 28, 2024

Hi @0xDEADFED5 sorry for the late reply. Speculative decoding works with two models, so only changing --speculative-mode to small_model won't work. Thanks for bringing this up, and we'll improve the error message to avoid the confusion here.

Here's an example command you could use to enable speculative decoding, which uses the 4-bit quantized Llama3 8B model to speculate the unquantized 8B model.

mlc_llm serve "HF://mlc-ai/Llama-3-8B-Instruct-q0f16-MLC" \
  --additional-models "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC" \
  --speculative-mode "small_draft"

@0xDEADFED5
Copy link
Author

interesting! thanks for the reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question about the usage
Projects
None yet
Development

No branches or pull requests

2 participants