Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ggml: support more parameters from llama.cpp #3314

Open
dm4 opened this issue Apr 2, 2024 · 4 comments
Open

feat: ggml: support more parameters from llama.cpp #3314

dm4 opened this issue Apr 2, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@dm4
Copy link
Collaborator

dm4 commented Apr 2, 2024

Summary

We currently support some parameters from llama.cpp, such as n_gpu_layers, cox-size, thread, etc., and we expect to support even more parameters.

Details

Refer to llama.cpp/common/common.cpp/gpt_params_find_arg(), planning to support additional parameters.

Appendix

List all options:

  • --seed
  • --threads
  • --threads-batch
  • --threads-draft
  • --threads-batch-draft
  • --prompt
  • --escape
  • --prompt-cache
  • --prompt-cache-all
  • --prompt-cache-ro
  • --binary-file
  • --file
  • --n-predict
  • --top-k
  • --ctx-size
  • --grp-attn-n
  • --grp-attn-w
  • --rope-freq-base
  • --rope-freq-scale
  • --rope-scaling
  • --rope-scale
  • --yarn-orig-ctx
  • --yarn-ext-factor
  • --yarn-attn-factor
  • --yarn-beta-fast
  • --yarn-beta-slow
  • --pooling
  • --defrag-thold
  • --samplers
  • --sampling-seq
  • --top-p
  • --min-p
  • --temp
  • --tfs
  • --typical
  • --repeat-last-n
  • --repeat-penalty
  • --frequency-penalty
  • --presence-penalty
  • --dynatemp-range
  • --dynatemp-exp
  • --mirostat
  • --mirostat-lr
  • --mirostat-ent
  • --cfg-negative-prompt
  • --cfg-negative-prompt-file
  • --cfg-scale
  • --batch-size
  • --ubatch-size
  • --keep
  • --draft
  • --chunks
  • --parallel
  • --sequences
  • --p-split
  • --model
  • --model-draft
  • --alias
  • --model-url
  • --hf-repo
  • --hf-file
  • --lora
  • --lora-scaled
  • --lora-base
  • --control-vector
  • --control-vector-scaled
  • --control-vector-layer-range
  • --mmproj
  • --image
  • --interactive
  • --embedding
  • --interactive-first
  • --instruct
  • --chatml
  • --infill
  • --dump-kv-cache
  • --no-kv-offload
  • --cache-type-k
  • --cache-type-v
  • --multiline-input
  • --simple-io
  • --cont-batching
  • --color
  • --mlock
  • --gpu-layers --n-gpu-layers
  • --gpu-layers-draft --n-gpu-layers-draft
  • --main-gpu
  • --split-mode
  • --tensor-split
  • --no-mmap
  • --numa
  • --verbose-prompt
  • --no-display-prompt
  • --reverse-prompt
  • --logdir
  • --lookup-cache-static
  • --lookup-cache-dynamic
  • --save-all-logits --kl-divergence-base
  • --perplexity --all-logits
  • --ppl-stride
  • --print-token-count
  • --ppl-output-type
  • --hellaswag
  • --hellaswag-tasks
  • --winogrande
  • --winogrande-tasks
  • --multiple-choice
  • --multiple-choice-tasks
  • --kl-divergence
  • --ignore-eos
  • --no-penalize-nl
  • --logit-bias
  • --help
  • --version
  • --random-prompt
  • --in-prefix-bos
  • --in-prefix
  • --in-suffix
  • --grammar
  • --grammar-file
  • --override-kv
@dm4 dm4 added the enhancement New feature or request label Apr 2, 2024
@jaydee029
Copy link
Contributor

is this issue open for contributions? if yes I would love to look into this.

@dm4
Copy link
Collaborator Author

dm4 commented Apr 6, 2024

is this issue open for contributions? if yes I would love to look into this.

Yes, this issue is open for contributions. We welcome your input and any code related to this issue.

@Fusaaaann
Copy link
Contributor

Fusaaaann commented May 11, 2024

some parameters, such as --parallel and --draft, are not directly used in internal implementation of llama.cpp, according to search result for "n_parallel" in llama.cpp.
only some parameters would affect internal behavior of llama.cpp functions, like parameters related to RoPE, otherwise integrating processing logics to support the additional parameters could totally change implementation of compute(), like the example below:

Abstract of integrating `--parallel` `--draft` and parsing it as an optional parameter in WasmEdge
struct Graph {
    // ...
    uint64_t NParallel = 1; 
    uint64_t NDraft = 1;
}

Expect<ErrNo> compute(WasiNNEnvironment &Env, uint32_t ContextId) noexcept {
    // ...
    // if --draft and --parallel are set
    ReturnCode = SpeculativeDecoding(GraphRef, CxtRef);
    // else use current implementation
    // ...
}

ErrNo SpeculativeDecoding(Graph &GraphRef, Context &CxtRef) noexcept {
    // implementation like https://github.com/ggerganov/llama.cpp/blob/3292733f95d4632a956890a438af5192e7031c12/examples/speculative/speculative.cpp
}

detailed code: https://github.com/Fusaaaann/WasmEdge/blob/ae718df452658df555e2b4fe35e8c90e69c5c55f/plugins/wasi_nn/strategies/strategies.cpp#L234

what is WasmEdge's future planning for supporting these parameters, if wasi-nn functions could become too complex to fit in one ggml.cpp file due to support for these parameters?

@hydai
Copy link
Member

hydai commented May 14, 2024

Hi @Fusaaaann
We don't have a robust timeline for supporting the above parameters. If there is an application that will require such options, then we will increase the priority of them. There are already two different ways to handle normal LLM and LLaVA applications in our plugin; we don't matter if the complexity increases after adding more parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants