Releases · NVIDIA/TensorRT-LLM

05 Jun 13:02

kaiyux

v0.10.0

9bd15f1

Latest

Hi,

We are very pleased to announce the 0.10.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.

This update includes:

Key Features and Enhancements

The Python high level API
- Added embedding parallel, embedding sharing, and fused MLP support.
- Enabled the usage of the executor API.
Added a weight-stripping feature with a new trtllm-refit command. For more information, refer to examples/sample_weight_stripping/README.md.
Added a weight-streaming feature. For more information, refer to docs/source/advanced/weight-streaming.md.
Enhanced the multiple profiles feature; --multiple_profiles argument in trtllm-build command builds more optimization profiles now for better performance.
Added FP8 quantization support for Mixtral.
Added support for pipeline parallelism for GPT.
Optimized applyBiasRopeUpdateKVCache kernel by avoiding re-computation.
Reduced overheads between enqueue calls of TensorRT engines.
Added support for paged KV cache for enc-dec models. The support is limited to beam width 1.
Added W4A(fp)8 CUTLASS kernels for the NVIDIA Ada Lovelace architecture.
Added debug options (--visualize_network and --dry_run) to the trtllm-build command to visualize the TensorRT network before engine build.
Integrated the new NVIDIA Hopper XQA kernels for LLaMA 2 70B model.
Improved the performance of pipeline parallelism when enabling in-flight batching.
Supported quantization for Nemotron models.
Added LoRA support for Mixtral and Qwen.
Added in-flight batching support for ChatGLM models.
Added support to ModelRunnerCpp so that it runs with the executor API for IFB-compatible models.
Enhanced the custom AllReduce by adding a heuristic; fall back to use native NCCL kernel when hardware requirements are not satisfied to get the best performance.
Optimized the performance of checkpoint conversion process for LLaMA.
Benchmark
- [BREAKING CHANGE] Moved the request rate generation arguments and logic from prepare dataset script to gptManagerBenchmark.
- Enabled streaming and support Time To the First Token (TTFT) latency and Inter-Token Latency (ITL) metrics for gptManagerBenchmark.
- Added the --max_attention_window option to gptManagerBenchmark.

API Changes

[BREAKING CHANGE] Set the default tokens_per_block argument of the trtllm-build command to 64 for better performance.
[BREAKING CHANGE] Migrated enc-dec models to the unified workflow.
[BREAKING CHANGE] Renamed GptModelConfig to ModelConfig.
[BREAKING CHANGE] Added speculative decoding mode to the builder API.
[BREAKING CHANGE] Refactor scheduling configurations
- Unified the SchedulerPolicy with the same name in batch_scheduler and executor, and renamed it to CapacitySchedulerPolicy.
- Expanded the existing configuration scheduling strategy from SchedulerPolicy to SchedulerConfig to enhance extensibility. The latter also introduces a chunk-based configuration called ContextChunkingPolicy.
[BREAKING CHANGE] The input prompt was removed from the generation output in the generate() and generate_async() APIs. For example, when given a prompt as A B, the original generation result could be <s>A B C D E where only C D E is the actual output, and now the result is C D E.
[BREAKING CHANGE] Switched default add_special_token in the TensorRT-LLM backend to True.
Deprecated GptSession and TrtGptModelV1.

Model Updates

Support DBRX
Support Qwen2
Support CogVLM
Support ByT5
Support LLaMA 3
Support Arctic (w/ FP8)
Support Fuyu
Support Persimmon
Support Deplot
Support Phi-3-Mini with long Rope
Support Neva
Support Kosmos-2
Support RecurrentGemma

Fixed Issues

Fixed some unexpected behaviors in beam search and early stopping, so that the outputs are more accurate.
Fixed segmentation fault with pipeline parallelism and gather_all_token_logits. (#1284)
Removed the unnecessary check in XQA to fix code Llama 70b Triton crashes. (#1256)
Fixed an unsupported ScalarType issue for BF16 LoRA. (triton-inference-server/tensorrtllm_backend#403)
Eliminated the load and save of prompt table in multimodal. (#1436)
Fixed an error when converting the models weights of Qwen 72B INT4-GPTQ. (#1344)
Fixed early stopping and failures on in-flight batching cases of Medusa. (#1449)
Added support for more NVLink versions for auto parallelism. (#1467)
Fixed the assert failure caused by default values of sampling config. (#1447)
Fixed a requirement specification on Windows for nvidia-cudnn-cu12. (#1446)
Fixed MMHA relative position calculation error in gpt_attention_plugin for enc-dec models. (#1343)

Infrastructure changes

Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.03-py3.
Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.03-py3.
The dependent TensorRT version is updated to 10.0.1.
The dependent CUDA version is updated to 12.4.0.
The dependent PyTorch version is updated to 2.2.2.

Currently, there are two key branches in the project:

The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
The main branch is the dev branch. It is more experimental.

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

Assets 2

0 Join discussion

16 Apr 04:38

kaiyux

v0.9.0

250d9c2

TensorRT-LLM 0.9.0 Release

Hi,

We are very pleased to announce the 0.9.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.

This update includes:

Model Support
- Support distil-whisper, thanks to the contribution from @Bhuvanesh09 in PR #1061
- Support HuggingFace StarCoder2
- Support VILA
- Support Smaug-72B-v0.1
- Migrate BLIP-2 examples to examples/multimodal
Features
- [BREAKING CHANGE] TopP sampling optimization with deterministic AIR TopP algorithm is enabled by default
- [BREAKING CHANGE] Support embedding sharing for Gemma
- Add support to context chunking to work with KV cache reuse
- Enable different rewind tokens per sequence for Medusa
- BART LoRA support (limited to the Python runtime)
- Enable multi-LoRA for BART LoRA
- Support early_stopping=False in beam search for C++ Runtime
- Add logits post processor to the batch manager (see docs/source/batch_manager.md#logits-post-processor-optional)
- Support import and convert HuggingFace Gemma checkpoints, thanks for the contribution from @mfuntowicz in #1147
- Support loading Gemma from HuggingFace
- Support auto parallelism planner for high-level API and unified builder workflow
- Support run GptSession without OpenMPI #1220
- Medusa IFB support
- [Experimental] Support FP8 FMHA, note that the performance is not optimal, and we will keep optimizing it
- More head sizes support for LLaMA-like models
  - Ampere (sm80, sm86), Ada (sm89), Hopper(sm90) all support head sizes [32, 40, 64, 80, 96, 104, 128, 160, 256] now.
- OOTB functionality support
  - T5
  - Mixtral 8x7B
API
- C++ executor API
  - Add Python bindings, see documentation and examples in examples/bindings
  - Add advanced and multi-GPU examples for Python binding of executor C++ API, see examples/bindings/README.md
  - Add documents for C++ executor API, see docs/source/executor.md
- High-level API (refer to examples/high-level-api/README.md for guidance)
  - [BREAKING CHANGE] Reuse the QuantConfig used in trtllm-build tool, support broader quantization features
  - Support in LLM() API to accept engines built by trtllm-build command
  - Add support for TensorRT-LLM checkpoint as model input
  - Refine SamplingConfig used in LLM.generate or LLM.generate_async APIs, with the support of beam search, a variety of penalties, and more features
  - Add support for the StreamingLLM feature, enable it by setting LLM(streaming_llm=...)
  - Migrate Mixtral to high level API and unified builder workflow
- [BREAKING CHANGE] Refactored Qwen model to the unified build workflow, see examples/qwen/README.md for the latest commands
- [BREAKING CHANGE] Move LLaMA convert checkpoint script from examples directory into the core library
- [BREAKING CHANGE] Refactor GPT with unified building workflow, see examples/gpt/README.md for the latest commands
- [BREAKING CHANGE] Removed all the lora related flags from convert_checkpoint.py script and the checkpoint content to trtllm-build command, to generalize the feature better to more models
- [BREAKING CHANGE] Removed the use_prompt_tuning flag and options from convert_checkpoint.py script and the checkpoint content, to generalize the feature better to more models. Use the trtllm-build --max_prompt_embedding_table_size instead.
- [BREAKING CHANGE] Changed the trtllm-build --world_size flag to --auto_parallel flag, the option is used for auto parallel planner only.
- [BREAKING CHANGE] AsyncLLMEngine is removed, tensorrt_llm.GenerationExecutor class is refactored to work with both explicitly launching with mpirun in the application level, and accept an MPI communicator created by mpi4py
- [BREAKING CHANGE] examples/server are removed, see examples/app instead.
- [BREAKING CHANGE] Remove LoRA related parameters from convert checkpoint scripts
- [BREAKING CHANGE] Simplify Qwen convert checkpoint script
- [BREAKING CHANGE] Remove model parameter from gptManagerBenchmark and gptSessionBenchmark
Bug fixes
- Fix a weight-only quant bug for Whisper to make sure that the encoder_input_len_range is not 0, thanks to the contribution from @Eddie-Wang1120 in #992
- Fix the issue that log probabilities in Python runtime are not returned #983
- Multi-GPU fixes for multimodal examples #1003
- Fix wrong end_id issue for Qwen #987
- Fix a non-stopping generation issue #1118 #1123
- Fix wrong link in examples/mixtral/README.md #1181
- Fix LLaMA2-7B bad results when int8 kv cache and per-channel int8 weight only are enabled #967
- Fix wrong head_size when importing Gemma model from HuggingFace Hub, thanks for the contribution from @mfuntowicz in #1148
- Fix ChatGLM2-6B building failure on INT8 #1239
- Fix wrong relative path in Baichuan documentation #1242
- Fix wrong SamplingConfig tensors in ModelRunnerCpp #1183
- Fix error when converting SmoothQuant LLaMA #1267
- Fix the issue that examples/run.py only load one line from --input_file
- Fix the issue that ModelRunnerCpp does not transfer SamplingConfig tensor fields correctly #1183
Benchmark
- Add emulated static batching in gptManagerBenchmark
- Support arbitrary dataset from HuggingFace for C++ benchmarks, see “Prepare dataset” section in benchmarks/cpp/README.md
- Add percentile latency report to gptManagerBenchmark
Performance
- Optimize gptDecoderBatch to support batched sampling
- Enable FMHA for models in BART, Whisper and NMT family
- Remove router tensor parallelism to improve performance for MoE models, thanks to the contribution from @megha95 in #1091
- Improve custom all-reduce kernel
Infra
- Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.02-py3
- Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.02-py3
- The dependent TensorRT version is updated to 9.3
- The dependent PyTorch version is updated to 2.2
- The dependent CUDA version is updated to 12.3.2 (a.k.a. 12.3 Update 2)

Currently, there are two key branches in the project:

The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
The main branch is the dev branch. It is more experimental.

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,

The TensorRT-LLM Engineering Team

Contributors

mfuntowicz, Bhuvanesh09, and 2 other contributors

Assets 2

0 Join discussion

29 Feb 09:54

kaiyux

v0.8.0

5955b8a

TensorRT-LLM 0.8.0 Release

Hi,

We are very pleased to announce the 0.8.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.

This update includes:

Model Support
- Phi-1.5/2.0
- Mamba support (see examples/mamba/README.md)
  - The support is limited to beam width = 1 and single-node single-GPU
- Nougat support (see examples/multimodal/README.md#nougat)
- Qwen-VL support (see examples/qwenvl/README.md)
- RoBERTa support, thanks to the contribution from @erenup
- Skywork model support
- Add example for multimodal models (BLIP with OPT or T5, LlaVA)
Features
- Chunked context support (see docs/source/gpt_attention.md#chunked-context)
- LoRA support for C++ runtime (see docs/source/lora.md)
- Medusa decoding support (see examples/medusa/README.md)
  - The support is limited to Python runtime for Ampere or newer GPUs with fp16 and bf16 accuracy, and the temperature parameter of sampling configuration should be 0
- StreamingLLM support for LLaMA (see docs/source/gpt_attention.md#streamingllm)
- Support for batch manager to return logits from context and/or generation phases
  - Include support in the Triton backend
- Support AWQ and GPTQ for QWEN
- Support ReduceScatter plugin
- Support for combining repetition_penalty and presence_penalty #274
- Support for frequency_penalty #275
- OOTB functionality support:
  - Baichuan
  - InternLM
  - Qwen
  - BART
- LLaMA
  - Support enabling INT4-AWQ along with FP8 KV Cache
  - Support BF16 for weight-only plugin
- Baichuan
  - P-tuning support
  - INT4-AWQ and INT4-GPTQ support
- Decoder iteration-level profiling improvements
- Add masked_select and cumsum function for modeling
- Smooth Quantization support for ChatGLM2-6B / ChatGLM3-6B / ChatGLM2-6B-32K
- Add Weight-Only Support To Whisper #794, thanks to the contribution from @Eddie-Wang1120
- Support FP16 fMHA on NVIDIA V100 GPU
API
- Add a set of High-level APIs for end-to-end generation tasks (see examples/high-level-api/README.md)
- [BREAKING CHANGES] Migrate models to the new build workflow, including LLaMA, Mistral, Mixtral, InternLM, ChatGLM, Falcon, GPT-J, GPT-NeoX, Medusa, MPT, Baichuan and Phi (see docs/source/new_workflow.md)
- [BREAKING CHANGES] Deprecate LayerNorm and RMSNorm plugins and removed corresponding build parameters
- [BREAKING CHANGES] Remove optional parameter maxNumSequences for GPT manager
Bug fixes
- Fix the first token being abnormal issue when --gather_all_token_logits is enabled #639
- Fix LLaMA with LoRA enabled build failure #673
- Fix InternLM SmoothQuant build failure #705
- Fix Bloom int8_kv_cache functionality #741
- Fix crash in gptManagerBenchmark #649
- Fix Blip2 build error #695
- Add pickle support for InferenceRequest #701
- Fix Mixtral-8x7b build failure with custom_all_reduce #825
- Fix INT8 GEMM shape #935
- Minor bug fixes
Performance
- [BREAKING CHANGES] Increase default freeGpuMemoryFraction parameter from 0.85 to 0.9 for higher throughput
- [BREAKING CHANGES] Disable enable_trt_overlap argument for GPT manager by default
- Performance optimization of beam search kernel
- Add bfloat16 and paged kv cache support for optimized generation MQA/GQA kernels
- Custom AllReduce plugins performance optimization
- Top-P sampling performance optimization
- LoRA performance optimization
- Custom allreduce performance optimization by introducing a ping-pong buffer to avoid an extra synchronization cost
- Integrate XQA kernels for GPT-J (beamWidth=4)
Documentation
- Batch manager arguments documentation updates
- Add documentation for best practices for tuning the performance of TensorRT-LLM (See docs/source/perf_best_practices.md)
- Add documentation for Falcon AWQ support (See examples/falcon/README.md)
- Update to the docs/source/new_workflow.md documentation
- Update AWQ INT4 weight only quantization documentation for GPT-J
- Add blog: Speed up inference with SOTA quantization techniques in TRT-LLM
- Refine TensorRT-LLM backend README structure #133
- Typo fix #739

Currently, there are two key branches in the project:

The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
The main branch is the dev branch. It is more experimental.

Thanks,

The TensorRT-LLM Engineering Team

Contributors

erenup and Eddie-Wang1120

Assets 2

0 Join discussion

27 Dec 01:59

kaiyux

v0.7.1

80bc075

TensorRT-LLM 0.7.1 Release

Hi,

We are very pleased to announce the 0.7.1 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.

This update includes:

Models
- BART and mBART support in encoder-decoder models
- FairSeq Neural Machine Translation (NMT) family
- Mixtral-8x7B model
  - Support weight loading for HuggingFace Mixtral model
- OpenAI Whisper
- Mixture of Experts support
- MPT - Int4 AWQ / SmoothQuant support
- Baichuan FP8 quantization support
Features
- [Preview] Speculative decoding
- Add Python binding for GptManager
- Add a Python class ModelRunnerCpp that wraps C++ gptSession
- System prompt caching
- Enable split-k for weight-only cutlass kernels
- FP8 KV cache support for XQA kernel
- New Python builder API and trtllm-build command(already applied to blip2 and OPT )
- Support StoppingCriteria and LogitsProcessor in Python generate API (thanks to the contribution from @zhang-ge-hao)
- fMHA support for chunked attention and paged kv cache
Bug fixes
- Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
- Fix LLaMa with LoRA error #637
- Fix LLaMA GPTQ failure #580
- Fix Python binding for InferenceRequest issue #528
- Fix CodeLlama SQ accuracy issue #453
- Minor bug fixes
Performance
- MMHA optimization for MQA and GQA
- LoRA optimization: cutlass grouped gemm
- Optimize Hopper warp specialized kernels
- Optimize AllReduce for parallel attention on Falcon and GPT-J
- Enable split-k for weight-only cutlass kernel when SM>=75
Documentation
- Add documentation for new builder workflow

Currently, there are two key branches in the project:

The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
The main branch is the dev branch. It is more experimental.

Thanks,

The TensorRT-LLM Engineering Team

Contributors

0xymoro and zhang-ge-hao

Assets 2

04 Dec 11:11

kaiyux

v0.6.1

9b3e12d

TensorRT-LLM 0.6.1 Release

Hi,

We are very pleased to announce the 0.6.1 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.

This update includes:

Models
- ChatGLM3
- InternLM (contributed by @wangruohui)
- Mistral 7B (developed in collaboration with Mistral.AI)
- MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
- Qwen (contributed by @Tlntin and @zhaohb)
- Replit Code V-1.5 3B (contributed by @bheilbrun)
- T5, mT5, Flan-T5 (Python runtime only, contributed by @mlmonk and @nqbao11)
Features
- Add runtime statistics related to active requests and KV cache utilization from the batch manager (see the batch manager documentation)
- Add sequence_length tensor to support proper lengths in beam-search (when beam-width > 1 - see tensorrt_llm/batch_manager/GptManager.h)
- BF16 support for encoder-decoder models (Python runtime - see examples/enc_dec)
- Improvements to memory utilization (CPU and GPU - including memory leaks)
- Improved error reporting and memory consumption
- Improved support for stop and bad words
- INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see examples/baichuan)
- INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only support for the GPT-J model (see examples/gptj)
- INT4 AWQ support for the Falcon models (see examples/falcon)
- LoRA support (functional preview only - limited to the Python runtime, only QKV support and not optimized in terms of runtime performance) for the GPT model (see the Run LoRA with the Nemo checkpoint in the GPT example)
- Multi-GPU support for encoder-decoder models (Python runtime - see examples/enc_dec)
- New heuristic for launching the Multi-block Masked MHA kernel (similar to FlashDecoding - see decoderMaskedMultiheadAttentionLaunch.h)
- Prompt-Tuning support for GPT and LLaMA models (see the Prompt-tuning Section in the GPT example)
- Performance optimizations in various CUDA kernels
- Possibility to exclude input tokens from the output (see excludeInputInOutput in GptManager)
- Python binding for the C++ runtime (GptSession - see pybind)
- Support for different micro batch sizes for context and generation phases with pipeline parallelism (see GptSession::Config::ctxMicroBatchSize and GptSession::Config::genMicroBatchSize in tensorrt_llm/runtime/gptSession.h)
- Support for "remove input padding" for encoder-decoder models (see examples/enc_dec)
- Support for context and generation logits (see mComputeContextLogits and mComputeGenerationLogits in tensorrt_llm/runtime/gptModelConfig.h)
- Support for logProbs and cumLogProbs (see "output_log_probs" and "cum_log_probs" in GptManager)
- Update to CUTLASS 3.x
Bug fixes
- Fix for ChatGLM2 #93 and #138
- Fix tensor names error "RuntimeError: Tensor names (host_max_kv_cache_length) in engine are not the same as expected in the main branch" #369
- Fix weights split issue in BLOOM when world_size = 2 ("array split does not result in an equal division") #374
- Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
- Fix a crash in GenerationSession if stream keyword argument is not None #202
- Fix a typo when calling PyNVML API [BUG] code bug #410
- Fix bugs related to the improper management of the end_id for various models [C++ and Python]
- Fix memory leaks [C++ code and Python models]
- Fix the std::alloc error when running the gptManagerBenchmark -- issue gptManagerBenchmark std::bad_alloc error #66
- Fix a bug in pipeline parallelism when beam-width > 1
- Fix a bug with Llama GPTQ due to improper support of GQA
- Fix issue #88
- Fix an issue with the Huggingface Transformers version #16
- Fix link jump in windows readme.md #30 - by @yuanlehome
- Fix typo in batchScheduler.h #56 - by @eltociear
- Fix typo #58 - by @RichardScottOZ
- Fix Multi-block MMHA: Difference between max_batch_size in the engine builder and max_num_sequences in TrtGptModelOptionalParams? #65
- Fix the log message to be more accurate on KV cache #224
- Fix Windows release wheel installation: Failed to install the release wheel for Windows using pip #261
- Fix missing torch dependencies: [BUG] The batch_manage.a choice error in --cpp-only when torch's cxx_abi version is different with gcc #151
- Fix linking error during compiling google-test & benchmarks #277
- Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by the lack of bfloat16 #335
- Minor bug fixes

Currently, there are two key branches in the project:

The rel branch contains what we'd call the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
The main branch contains what we'd call the dev branch. It is more experimental.

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently. The exact frequencies depend on your feedback.

Thanks,

The TensorRT-LLM Engineering Team

Contributors

zhaohb, mlmonk, and 7 other contributors

Assets 2

19 Oct 13:14

juney-nvidia

v0.5.0

ffd5af3

The first release of TensorRT-LLM

revise the homepage (#14)

Co-authored-by: Shi Xiaowei <xiaoweis@nvidia.com>

Assets 2

Provide feedback

Saved searches