blip-2-vision-language
$
npx mdskill add Orchestra-Research/AI-Research-SKILLs/blip-2-vision-languageEnables vision-language tasks using frozen image encoders and LLMs
- Solves image captioning, VQA, and multimodal chat with zero-shot performance
- Relies on Hugging Face Transformers, PyTorch, and Pillow for processing
- Leverages Q-Former architecture to bridge vision and language models
- Delivers results via natural language outputs and image-text matching
SKILL.md
.github/skills/blip-2-vision-languageView on GitHub ↗
---
name: blip-2-vision-language
description: Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Multimodal, Vision-Language, Image Captioning, VQA, Zero-Shot]
dependencies: [transformers>=4.30.0, torch>=1.10.0, Pillow]
---
# BLIP-2: Vision-Language Pre-training
Comprehensive guide to using Salesforce's BLIP-2 for vision-language tasks with frozen image encoders and large language models.
## When to use BLIP-2
**Use BLIP-2 when:**
- Need high-quality image captioning with natural descriptions
- Building visual question answering (VQA) systems
- Require zero-shot image-text understanding without task-specific training
- Want to leverage LLM reasoning for visual tasks
- Building multimodal conversational AI
- Need image-text retrieval or matching
**Key features:**
- **Q-Former architecture**: Lightweight query transformer bridges vision and language
- **Frozen backbone efficiency**: No need to fine-tune large vision/language models
- **Multiple LLM backends**: OPT (2.7B, 6.7B) and FlanT5 (XL, XXL)
- **Zero-shot capabilities**: Strong performance without task-specific training
- **Efficient training**: Only trains Q-Former (~188M parameters)
- **State-of-the-art results**: Beats larger models on VQA benchmarks
**Use alternatives instead:**
- **LLaVA**: For instruction-following multimodal chat
- **InstructBLIP**: For improved instruction-following (BLIP-2 successor)
- **GPT-4V/Claude 3**: For production multimodal chat (proprietary)
- **CLIP**: For simple image-text similarity without generation
- **Flamingo**: For few-shot visual learning
## Quick start
### Installation
```bash
# HuggingFace Transformers (recommended)
pip install transformers accelerate torch Pillow
# Or LAVIS library (Salesforce official)
pip install salesforce-lavis
```
### Basic image captioning
```python
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
# Load model and processor
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
device_map="auto"
)
# Load image
image = Image.open("photo.jpg").convert("RGB")
# Generate caption
inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(caption)
```
### Visual question answering
```python
# Ask a question about the image
question = "What color is the car in this image?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(answer)
```
### Using LAVIS library
```python
import torch
from lavis.models import load_model_and_preprocess
from PIL import Image
# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_opt",
model_type="pretrain_opt2.7b",
is_eval=True,
device=device
)
# Process image
image = Image.open("photo.jpg").convert("RGB")
image = vis_processors["eval"](image).unsqueeze(0).to(device)
# Caption
caption = model.generate({"image": image})
print(caption)
# VQA
question = txt_processors["eval"]("What is in this image?")
answer = model.generate({"image": image, "prompt": question})
print(answer)
```
## Core concepts
### Architecture overview
```
BLIP-2 Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Q-Former │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Learned Queries (32 queries × 768 dim) │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ Cross-Attention with Image Features │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ Self-Attention Layers (Transformer) │ │
│ └────────────────────────┬────────────────────────────┘ │
└───────────────────────────┼─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ Frozen Vision Encoder │ Frozen LLM │
│ (ViT-G/14 from EVA-CLIP) │ (OPT or FlanT5) │
└─────────────────────────────────────────────────────────────┘
```
### Model variants
| Model | LLM Backend | Size | Use Case |
|-------|-------------|------|----------|
| `blip2-opt-2.7b` | OPT-2.7B | ~4GB | General captioning, VQA |
| `blip2-opt-6.7b` | OPT-6.7B | ~8GB | Better reasoning |
| `blip2-flan-t5-xl` | FlanT5-XL | ~5GB | Instruction following |
| `blip2-flan-t5-xxl` | FlanT5-XXL | ~13GB | Best quality |
### Q-Former components
| Component | Description | Parameters |
|-----------|-------------|------------|
| Learned queries | Fixed set of learnable embeddings | 32 × 768 |
| Image transformer | Cross-attention to vision features | ~108M |
| Text transformer | Self-attention for text | ~108M |
| Linear projection | Maps to LLM dimension | Varies |
## Advanced usage
### Batch processing
```python
from PIL import Image
import torch
# Load multiple images
images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)]
questions = [
"What is shown in this image?",
"Describe the scene.",
"What colors are prominent?",
"Is there a person in this image?"
]
# Process batch
inputs = processor(
images=images,
text=questions,
return_tensors="pt",
padding=True
).to("cuda", torch.float16)
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=50)
answers = processor.batch_decode(generated_ids, skip_special_tokens=True)
for q, a in zip(questions, answers):
print(f"Q: {q}\nA: {a}\n")
```
### Controlling generation
```python
# Control generation parameters
generated_ids = model.generate(
**inputs,
max_new_tokens=100,
min_length=20,
num_beams=5, # Beam search
no_repeat_ngram_size=2, # Avoid repetition
top_p=0.9, # Nucleus sampling
temperature=0.7, # Creativity
do_sample=True, # Enable sampling
)
# For deterministic output
generated_ids = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
do_sample=False,
)
```
### Memory optimization
```python
# 8-bit quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-6.7b",
quantization_config=quantization_config,
device_map="auto"
)
# 4-bit quantization (more aggressive)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-flan-t5-xxl",
quantization_config=quantization_config,
device_map="auto"
)
```
### Image-text matching
```python
# Using LAVIS for ITM (Image-Text Matching)
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_image_text_matching",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
text = txt_processors["eval"]("a dog sitting on grass")
# Get matching score
itm_output = model({"image": image, "text_input": text}, match_head="itm")
itm_scores = torch.nn.functional.softmax(itm_output, dim=1)
print(f"Match probability: {itm_scores[:, 1].item():.3f}")
```
### Feature extraction
```python
# Extract image features with Q-Former
from lavis.models import load_model_and_preprocess
model, vis_processors, _ = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
# Get features
features = model.extract_features({"image": image}, mode="image")
image_embeds = features.image_embeds # Shape: [1, 32, 768]
image_features = features.image_embeds_proj # Projected for matching
```
## Common workflows
### Workflow 1: Image captioning pipeline
```python
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from pathlib import Path
class ImageCaptioner:
def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
def caption(self, image_path: str, prompt: str = None) -> str:
image = Image.open(image_path).convert("RGB")
if prompt:
inputs = self.processor(images=image, text=prompt, return_tensors="pt")
else:
inputs = self.processor(images=image, return_tensors="pt")
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def caption_batch(self, image_paths: list, prompt: str = None) -> list:
images = [Image.open(p).convert("RGB") for p in image_paths]
if prompt:
inputs = self.processor(
images=images,
text=[prompt] * len(images),
return_tensors="pt",
padding=True
)
else:
inputs = self.processor(images=images, return_tensors="pt", padding=True)
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(**inputs, max_new_tokens=50)
return self.processor.batch_decode(generated_ids, skip_special_tokens=True)
# Usage
captioner = ImageCaptioner()
# Single image
caption = captioner.caption("photo.jpg")
print(f"Caption: {caption}")
# With prompt for style
caption = captioner.caption("photo.jpg", "a detailed description of")
print(f"Detailed: {caption}")
# Batch processing
captions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"])
for i, cap in enumerate(captions):
print(f"Image {i+1}: {cap}")
```
### Workflow 2: Visual Q&A system
```python
class VisualQA:
def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
self.current_image = None
self.current_inputs = None
def set_image(self, image_path: str):
"""Load image for multiple questions."""
self.current_image = Image.open(image_path).convert("RGB")
def ask(self, question: str) -> str:
"""Ask a question about the current image."""
if self.current_image is None:
raise ValueError("No image set. Call set_image() first.")
# Format question for FlanT5
prompt = f"Question: {question} Answer:"
inputs = self.processor(
images=self.current_image,
text=prompt,
return_tensors="pt"
).to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def ask_multiple(self, questions: list) -> dict:
"""Ask multiple questions about current image."""
return {q: self.ask(q) for q in questions}
# Usage
vqa = VisualQA()
vqa.set_image("scene.jpg")
# Ask questions
print(vqa.ask("What objects are in this image?"))
print(vqa.ask("What is the weather like?"))
print(vqa.ask("How many people are there?"))
# Batch questions
results = vqa.ask_multiple([
"What is the main subject?",
"What colors are dominant?",
"Is this indoors or outdoors?"
])
```
### Workflow 3: Image search/retrieval
```python
import torch
import numpy as np
from PIL import Image
from lavis.models import load_model_and_preprocess
class ImageSearchEngine:
def __init__(self):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=self.device
)
self.image_features = []
self.image_paths = []
def index_images(self, image_paths: list):
"""Build index from images."""
self.image_paths = image_paths
for path in image_paths:
image = Image.open(path).convert("RGB")
image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)
with torch.no_grad():
features = self.model.extract_features({"image": image}, mode="image")
# Use projected features for matching
self.image_features.append(
features.image_embeds_proj.mean(dim=1).cpu().numpy()
)
self.image_features = np.vstack(self.image_features)
def search(self, query: str, top_k: int = 5) -> list:
"""Search images by text query."""
# Get text features
text = self.txt_processors["eval"](query)
text_input = {"text_input": [text]}
with torch.no_grad():
text_features = self.model.extract_features(text_input, mode="text")
text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()
# Compute similarities
similarities = np.dot(self.image_features, text_embeds.T).squeeze()
top_indices = np.argsort(similarities)[::-1][:top_k]
return [(self.image_paths[i], similarities[i]) for i in top_indices]
# Usage
engine = ImageSearchEngine()
engine.index_images(["img1.jpg", "img2.jpg", "img3.jpg", ...])
# Search
results = engine.search("a sunset over the ocean", top_k=5)
for path, score in results:
print(f"{path}: {score:.3f}")
```
## Output format
### Generation output
```python
# Direct generation returns token IDs
generated_ids = model.generate(**inputs, max_new_tokens=50)
# Shape: [batch_size, sequence_length]
# Decode to text
text = processor.batch_decode(generated_ids, skip_special_tokens=True)
# Returns: list of strings
```
### Feature extraction output
```python
# Q-Former outputs
features = model.extract_features({"image": image}, mode="image")
features.image_embeds # [B, 32, 768] - Q-Former outputs
features.image_embeds_proj # [B, 32, 256] - Projected for matching
features.text_embeds # [B, seq_len, 768] - Text features
features.text_embeds_proj # [B, 256] - Projected text (CLS)
```
## Performance optimization
### GPU memory requirements
| Model | FP16 VRAM | INT8 VRAM | INT4 VRAM |
|-------|-----------|-----------|-----------|
| blip2-opt-2.7b | ~8GB | ~5GB | ~3GB |
| blip2-opt-6.7b | ~16GB | ~9GB | ~5GB |
| blip2-flan-t5-xl | ~10GB | ~6GB | ~4GB |
| blip2-flan-t5-xxl | ~26GB | ~14GB | ~8GB |
### Speed optimization
```python
# Use Flash Attention if available
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2", # Requires flash-attn
device_map="auto"
)
# Compile model (PyTorch 2.0+)
model = torch.compile(model)
# Use smaller images (if quality allows)
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
# Default is 224x224, which is optimal
```
## Common issues
| Issue | Solution |
|-------|----------|
| CUDA OOM | Use INT8/INT4 quantization, smaller model |
| Slow generation | Use greedy decoding, reduce max_new_tokens |
| Poor captions | Try FlanT5 variant, use prompts |
| Hallucinations | Lower temperature, use beam search |
| Wrong answers | Rephrase question, provide context |
## References
- **[Advanced Usage](references/advanced-usage.md)** - Fine-tuning, integration, deployment
- **[Troubleshooting](references/troubleshooting.md)** - Common issues and solutions
## Resources
- **Paper**: https://arxiv.org/abs/2301.12597
- **GitHub (LAVIS)**: https://github.com/salesforce/LAVIS
- **HuggingFace**: https://huggingface.co/Salesforce/blip2-opt-2.7b
- **Demo**: https://huggingface.co/spaces/Salesforce/BLIP2
- **InstructBLIP**: https://arxiv.org/abs/2305.06500 (successor)
More from Orchestra-Research/AI-Research-SKILLs
- academic-plottingGenerates publication-quality figures for ML papers from research context. Given a paper section or description, extracts system components and relationships to generate architecture diagrams via Gemini. Given experiment results or data, auto-selects chart type and generates data-driven figures via matplotlib/seaborn. Use when creating any figure for a conference paper.
- ara-compilerCompiles any research input — PDF papers, GitHub repositories, experiment logs, code directories, or raw notes — into a complete Agent-Native Research Artifact (ARA) with cognitive layer (claims, concepts, heuristics), physical layer (configs, code stubs), exploration graph, and grounded evidence. Use when ingesting a paper or codebase into a structured, machine-executable knowledge package, building an ARA from scratch, or converting research outputs into a falsifiable, agent-traversable form.
- ara-research-managerRecords research provenance as a post-task epilogue, scanning conversation history at the end of a coding or research session to extract decisions, experiments, dead ends, claims, heuristics, and pivots, and writing them into the ara/ directory with user-vs-AI provenance tags. Use as a session epilogue — never during execution — to maintain a faithful, auditable trace of how a research project actually evolved.
- ara-rigor-reviewerPerforms ARA Seal Level 2 semantic epistemic review on Agent-Native Research Artifacts, scoring six dimensions (evidence relevance, falsifiability, scope calibration, argument coherence, exploration integrity, methodological rigor) and producing a constructive, severity-ranked report with a Strong Accept-to-Reject recommendation. Use after Level 1 structural validation passes, when an ARA needs an objective epistemic critique before publication or release.
- autogpt-agentsAutonomous AI agent platform for building and deploying continuous agents. Use when creating visual workflow agents, deploying persistent autonomous agents, or building complex multi-step AI automation systems.
- autoresearchOrchestrates end-to-end autonomous AI research projects using a two-loop architecture. The inner loop runs rapid experiment iterations with clear optimization targets. The outer loop synthesizes results, identifies patterns, and steers research direction. Routes to domain-specific skills for execution, supports continuous agent operation via Claude Code /loop and OpenClaw heartbeat, and produces research presentations and papers. Use when starting a research project, running autonomous experiments, or managing a multi-hypothesis research effort.
- awq-quantizationActivation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.
- brainstorming-research-ideasGuides researchers through structured ideation frameworks to discover high-impact research directions. Use when exploring new problem spaces, pivoting between projects, or seeking novel angles on existing work.
- constitutional-aiAnthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.
- creative-thinking-for-researchApplies cognitive science frameworks for creative thinking to CS and AI research ideation. Use when seeking genuinely novel research directions by leveraging combinatorial creativity, analogical reasoning, constraint manipulation, and other empirically grounded creative strategies.