[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
-
Updated
May 15, 2024 - Python
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
Collection of AWESOME vision-language models for vision tasks
DeepSeek-VL: Towards Real-World Vision-Language Understanding
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. 接近GPT-4V表现的可商用开源多模态对话模型
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension.
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
Effective prompting for Large Multimodal Models like GPT-4 Vision, LLaVA or CogVLM. 🔥
The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.
Grounded Multimodal Large Language Model with Localized Visual Tokenization
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
[ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
Codes for VPGTrans: Transfer Visual Prompt Generator across LLMs. VL-LLaMA, VL-Vicuna.
🎉 PILOT: A Pre-trained Model-Based Continual Learning Toolbox
Add a description, image, and links to the vision-language-model topic page so that developers can more easily learn about it.
To associate your repository with the vision-language-model topic, visit your repo's landing page and select "manage topics."