vision-language-model

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

ocr computer-vision artificial-intelligence text-recognition document text-detection document-analysis end-to-end-ocr multimodal scene-text-recognition multimodal-deep-learning scene-text-detection vision-language document-understanding scene-text-detection-recognition document-recognition document-intelligence documentai vision-language-transformer vision-language-model

Updated Apr 23, 2024
C++

InternLM / InternLM-XComposer

Star

InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension.

foundation gpt language-model multimodal multi-modality vision-transformer gpt-4 visual-language-learning llm chatgpt instruction-tuning large-language-model supervised-finetuning mllm vision-language-model large-vision-language-model

Updated May 8, 2024
Python

NVlabs / prismer

Star

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

vqa image-captioning language-model multi-task-learning vision-and-language multi-modal-learning vision-language-model

Updated Jan 17, 2024
Python

roboflow / multimodal-maestro

Star

Effective prompting for Large Multimodal Models like GPT-4 Vision, LLaVA or CogVLM. 🔥

object-detection cross-modal multimodality instance-segmentation lmm gpt-4 visual-prompting prompt-engineering vision-language-model llava segment-anything gpt-4-vision

Updated Feb 13, 2024
Python

The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

ai gcc multimodality vlm cradle computer-control lmm grounding ai-agent large-language-models llm generative-ai vision-language-model ai-agents-framework general-computer-control personoid foundation-agent

Updated Apr 15, 2024
Python

FoundationVision / Groma

Star

Grounded Multimodal Large Language Model with Localized Visual Tokenization

llama multimodal grounding foundation-models large-language-models llm mllm vision-language-model llama2

Updated May 15, 2024
Python

huangwl18 / VoxPoser

Star

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

robotics motion-planning robotic-manipulation embodied-ai foundation-models large-language-models vision-language-model

Updated May 8, 2024
Python

AlaaLab / InstructCV

Star

[ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"

generative-model text-to-image multi-task-learning diffusion-models stable-diffusion vision-language-model

Updated Apr 27, 2024
Python

PKU-YuanGroup / Chat-UniVi

Star

[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

video-understanding image-understanding large-language-models vision-language-model

Updated Apr 12, 2024
Python

mbzuai-oryx / groundingLMM

Star

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

vision-and-language lmm foundation-models vision-language-model llm-agent

Updated Apr 15, 2024
Python

SunzeY / AlphaCLIP

Star

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

machine-learning deep-learning vision-and-language vision-language vision-transformer vision-language-model

Updated Mar 4, 2024
Jupyter Notebook

OpenGVLab / Multi-Modality-Arena

Star

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!

chat chatbot vqa gradio multi-modality large-language-models llms chatgpt vision-language-model

Updated Apr 21, 2024
Python

VPGTrans / VPGTrans

Star

Codes for VPGTrans: Transfer Visual Prompt Generator across LLMs. VL-LLaMA, VL-Vicuna.

llm vision-language-model large-scale-language-modeling vl-llm

Updated Oct 13, 2023
Python

sun-hailong / LAMDA-PILOT

Star

🎉 PILOT: A Pre-trained Model-Based Continual Learning Toolbox

machine-learning deep-learning toolkit reproducible-research pytorch incremental-learning lifelong-learning continual-learning pre-trained-models vision-transformer vision-language-model

Updated Apr 28, 2024
Python

Improve this page

Add a description, image, and links to the vision-language-model topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the vision-language-model topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vision-language-model

Here are 106 public repositories matching this topic...

haotian-liu / LLaVA

QwenLM / Qwen-VL

dvlab-research / MGM

jingyi0000 / VLM_survey

deepseek-ai / DeepSeek-VL

OpenGVLab / InternVL

AlibabaResearch / AdvancedLiterateMachinery

InternLM / InternLM-XComposer

NVlabs / prismer

roboflow / multimodal-maestro

BAAI-Agents / Cradle

FoundationVision / Groma

huangwl18 / VoxPoser

AlaaLab / InstructCV

PKU-YuanGroup / Chat-UniVi

mbzuai-oryx / groundingLMM

SunzeY / AlphaCLIP

OpenGVLab / Multi-Modality-Arena

VPGTrans / VPGTrans

sun-hailong / LAMDA-PILOT

Improve this page

Add this topic to your repo