Skip to content

v0.2.0

Latest
Compare
Choose a tag to compare
@dbort dbort released this 29 Apr 22:39
· 478 commits to main since this release

Full Changelog: v0.1.0...v0.2.0

Foundational Improvements

Large generative AI model support

  • Support generative AI models like Meta Llama 3 8B and Llama 2 7B on Android and iOS phones
  • 4-bit group-wise weight quantization
  • XNNPACK Delegate and kernels for best performance on CPU (WIP on other backends)
  • KV Cache support through PyTorch mutable buffer
  • Custom ops for SDPA, with kv cache and multi-query attention
  • ExecuTorch Runtime + tokenizer and sampler

Core ExecuTorch improvements

  • Simplified setup experience
  • Support for PyTorch mutable buffers
  • Support for multi-gigabyte models
  • Constant data moved to its own .pte segment for more efficient serialization
  • Better kernel coverage in portable lib, XNNPACK, ARM, CoreML, MPS and HTP delegates.
  • SDK - better profiling and debugging within delegates
  • API improvements/simplification
  • Dozens of fixes to fuzzer-identified .pte file-parsing issues
  • Vulkan delegate for mobile GPU
  • Data-type based selective build for optimizing binary size
  • Compatibility with torchtune
  • More models supported across different backends
  • Python code now available as the "executorch" pip package in PyPI

Hardware Acceleration Improvements

Arm

  • Significant boost in operator test coverage thought the use of TOSA reference model, as well as improved CI coverage
  • Added support for quantization with the ArmQuantizer
  • Added support for MobileNet v2 TOSA generation
  • Working towards MobileNet v2 execution on Ethos-U
  • Added support for multiple new operators on Ethos-U compiler
  • Added NCHW/NHWC conversion for Ethos-U targets until NHWC is supported by ExecuTorch
  • Arm backend example now works on MacOS

Apple Core ML

  • [SDK] ExecuTorch SDK Integration for better debugging and profiling experience
  • [SDK] ExecuTorch SDK integration using the new MLComputePlan API released in iOS 17.4 and macOS 14.4
  • [SDK] A model lowered to the CoreML backend can be profiled using the ExecuTorch Inspector without additional setup
  • [SDK] Profiling surfaces Core ML specific information for each operation in the model, including: supported compute devices, preferred compute device, and estimated cost for each compute device.
  • [SDK] The Core ML delegate backend also supports logging intermediate tensors for model debugging.
  • [Partitioner] Enables a developer to lower a model even if Core ML doesn’t support all the operations in the model.
  • [Partitioner] A developer will now be able to specify the operations that should be skipped by the Core ML backend when lowering the model.
  • [Quantizer] Leverages PyTorch 2.0 export-based quantization APIs.
  • [Quantizer] Encodes specific quantization rules in order to optimize the model for execution on Apple silicon
  • [Quantizer] Integrated with ExecuTorch Core ML delegate conversion pipeline

Apple MPS

  • Support for over 100 ops (parity with PyTorch MPS backend supported ops)
  • Support for iOS/iPadOS>=14.4+ / macOS>=12.4
  • Support for MPSPartitioner
  • Support for following dtypes: fp16, fp32, bfloat16, int8, int16, int32, int64, uint8, bool
  • Support for profiling (etrecord, etdump) through Inspector API
  • Full unit testing coverage for AOT and runtime for all supported operators
  • Enabled storiesllama (floating point) on MPS

Qualcomm

  • Support for Snapdragon 8 Gen 3 is added.
  • Enabled on-device compilation. (aka QNN online-prepare)
  • Enabled 4-bit and 16-bit quantization.
  • Qualcomm AI Studio QNN Profiling is integrated into ExecuTorch flow.
  • Enabled storiesllama on HTP-fp16 (but this effort is mainly thanks to Chen Lai from Meta being the main contributor for this)
  • Added more operators support
  • Additional models validated since v0.1.0:
    • FbNet
    • W2l (Wav2LetterModel)
    • SSD300_VGG16
    • ViT
    • Quantized MobileBert (Quantized MobileBert contribution was submitted prior to v0.1.0 timeline, but merged afterwards)

Cadence HiFi

  • Expanded operator support for Cadence HiFi targets
  • Added first small model (RNNT-emformer predictor) to the Cadence HiFi examples

Model Support

Validated with one or more delegates

Meta Llama 2 7B LearningToPaint resnet50
Meta Llama 3 8B lennard_jones shufflenet_v2_x1_0
Conformer LSTM squeezenet1_1
dcgan maml_omniglot SqueezeSAM
Deeplab_v3 mnasnet1_0 timm_efficientnet
Edsr Mobilebert Torchvision_vit
Emformer_rnnt Mobilenet_v2 Wav2letter
functorch_dp_cifar10 Mobilenet_v3 Yolo v5
Inception_v3 phlippe_resnet
Inception_v4 resnet18

Tested with torch.export but not optimized for performance

Aquila 1 7B GPT-2 PLaMo 13B
Aquila 2 7B GPT-J 6B Qwen 1.5 7B
Baichuan 1 7B InternLM2 7B Refact
BioGPT Koala RWKV 5 world 1B5
BLOOM 7B1 MiniCPM 2B sft Stable LM 2 1.6B
Chinese Alpaca 2 7B Mistral 7B Stable LM 3B
Chinese LLaMA 2 7B Mixtral 8x7B MoE Starcoder
CodeShell Persimmon 8B chat Starcoder 2
Deepseek Phi 1 Vigogne (French)
GPT Neo 1.3B Phi 1.5 Yi 6B
GPT NeoX 20B Phi 2