Vision-Language Model Processing

Chunkr integrates Vision-Language Models (VLMs) to provide enhanced content generation, intelligent table processing, formula extraction, and custom segment analysis using visual understanding.

Overview

VLM processing in Chunkr allows you to:

Generate enhanced content using fine-tuned models (strategy: LLM)
Add custom LLM-powered analysis to any segment type
Process complex tables with better structure understanding
Extract mathematical formulas as LaTeX
Generate custom descriptions for images and diagrams

Generation Strategies

Auto Strategy

Uses heuristic-based generation - fast and efficient:

{
  "segment_processing": {
    "Text": {
      "strategy": "Auto"
    }
  }
}

Best for: Standard text, lists, simple formatting

LLM Strategy

Uses Chunkr’s fine-tuned Vision-Language Models:

{
  "segment_processing": {
    "Table": {
      "strategy": "LLM"
    }
  }
}

Best for: Tables, formulas, pictures, complex layouts

LLM strategy uses the page image as context to generate more accurate structured content

Default VLM Configuration

Some segment types use LLM strategy by default:

Segment Type	Default Strategy	Reason
`Table`	LLM	Better structure understanding
`Formula`	LLM	LaTeX extraction from images
`Picture`	LLM	Visual content description
`Page`	LLM	Full page understanding

Custom LLM Prompts

Add custom LLM-powered analysis to any segment using the llm field:

Python

import requests

response = requests.post(
    "https://api.chunkr.ai/api/v1/task/parse",
    headers={"Authorization": "YOUR_API_KEY"},
    json={
        "file": "https://example.com/document.pdf",
        "segment_processing": {
            "Picture": {
                "format": "Markdown",
                "strategy": "Auto",
                "llm": "Describe this image in detail, focusing on key visual elements and any text present."
            },
            "Table": {
                "format": "Html",
                "strategy": "LLM",
                "llm": "Extract the data and create a summary of key insights from this table."
            },
            "Text": {
                "format": "Markdown",
                "strategy": "Auto",
                "llm": "Summarize this section in 2-3 sentences."
            }
        }
    }
)

The LLM output is stored in segment.llm and can be included in chunks via embed_sources.

Extended Context

Use the full page image as context for LLM generation:

{
  "segment_processing": {
    "Table": {
      "format": "Html",
      "strategy": "LLM",
      "extended_context": true
    }
  }
}

Default: false

Extended context provides better results but increases processing time and token usage

Configuring LLM Models

Chunkr supports global LLM configuration to control which models are used for VLM processing.

LLM Processing Configuration

Python

{
    "llm_processing": {
        "model_id": "gpt-4o",
        "fallback_strategy": "Default",
        "max_completion_tokens": 2048,
        "temperature": 0.0
    }
}

Model ID
Fallback Strategy
Max Completion Tokens
Temperature

{
  "llm_processing": {
    "model_id": "gpt-4o"
  }
}

Specify which model to use. Check documentation for available models.

{
  "llm_processing": {
    "fallback_strategy": "Default"
  }
}

Options:

None: No fallback
Default: Use system default fallback
Model("model-id"): Use specific model as fallback

{
  "llm_processing": {
    "max_completion_tokens": 2048
  }
}

Control the maximum length of LLM-generated content.

{
  "llm_processing": {
    "temperature": 0.0
  }
}

Control randomness (0.0 = deterministic, 1.0 = creative).

Image Cropping

Control when segment images are cropped and stored:

Auto
All

{
  "crop_image": "Auto"
}

Default for most segments - Only crop when needed for post-processing

{
  "crop_image": "All"
}

Default for Pictures - Always crop and store segment images

Cropped images are available in segment.image as presigned URLs.

Complete VLM Example

Python

import requests
import time

# Advanced VLM configuration
config = {
    "file": "https://example.com/research-paper.pdf",
    "segmentation_strategy": "LayoutAnalysis",
    "ocr_strategy": "All",
    "high_resolution": true,  # Better image quality for VLM
    "llm_processing": {
        "model_id": "gpt-4o",
        "fallback_strategy": "Default",
        "temperature": 0.0
    },
    "segment_processing": {
        "Table": {
            "format": "Html",
            "strategy": "LLM",
            "crop_image": "All",
            "extended_context": true,
            "llm": "Convert this table to structured HTML and provide a brief summary of the data.",
            "embed_sources": ["Content", "LLM"]
        },
        "Picture": {
            "format": "Markdown",
            "strategy": "Auto",
            "crop_image": "All",
            "llm": "Describe this figure, including any charts, graphs, or diagrams. Explain what it illustrates.",
            "embed_sources": ["Content", "LLM"]
        },
        "Formula": {
            "format": "Markdown",
            "strategy": "LLM",
            "crop_image": "All",
            "llm": "Extract this formula as LaTeX and explain what it represents.",
            "embed_sources": ["Content", "LLM"]
        },
        "Text": {
            "format": "Markdown",
            "strategy": "Auto",
            "llm": "Provide a concise summary of this section.",
            "embed_sources": ["Content"]
        }
    }
}

# Create task
response = requests.post(
    "https://api.chunkr.ai/api/v1/task/parse",
    headers={"Authorization": "YOUR_API_KEY"},
    json=config
)
task = response.json()
task_id = task["task_id"]

# Poll for completion
while True:
    response = requests.get(
        f"https://api.chunkr.ai/api/v1/task/{task_id}",
        headers={"Authorization": "YOUR_API_KEY"}
    )
    task = response.json()
    
    if task["status"] == "Succeeded":
        break
    elif task["status"] == "Failed":
        raise Exception("Task failed")
    
    time.sleep(2)

# Access VLM-processed content
for chunk in task["output"]["chunks"]:
    for segment in chunk["segments"]:
        print(f"\nSegment Type: {segment['segment_type']}")
        print(f"Content: {segment['content'][:200]}...")
        
        # Access custom LLM output if configured
        if segment.get('llm'):
            print(f"LLM Analysis: {segment['llm'][:200]}...")
        
        # Access cropped image if available
        if segment.get('image'):
            print(f"Image URL: {segment['image']}")

Accessing VLM Output

VLM-generated content is available in multiple fields:

Python

for segment in chunk["segments"]:
    # Primary content (based on format and strategy)
    content = segment["content"]  # HTML or Markdown
    
    # Original OCR text
    text = segment["text"]
    
    # Custom LLM output (if configured)
    if "llm" in segment and segment["llm"]:
        llm_analysis = segment["llm"]
    
    # Specific format outputs (backward compatibility)
    html = segment["html"]
    markdown = segment["markdown"]
    
    # Cropped image URL (if available)
    if "image" in segment and segment["image"]:
        image_url = segment["image"]

Embed Sources with VLM

Control which VLM outputs are included in chunk embeddings:

Python

{
    "segment_processing": {
        "Table": {
            "format": "Html",
            "strategy": "LLM",
            "llm": "Summarize key data points from this table.",
            "embed_sources": ["Content", "LLM"]
        }
    }
}

Result: The chunk’s embed field will contain:

The HTML table structure (Content)
The LLM summary (LLM)

Order matters! Sources appear in the embed field in the order specified.

Use Cases

Enhanced Table Extraction

{
    "Table": {
        "format": "Html",
        "strategy": "LLM",
        "extended_context": true,
        "embed_sources": ["Content"]
    }
}

VLM provides better table structure understanding, especially for complex or merged cells.

Mathematical Formula Extraction

{
    "Formula": {
        "format": "Markdown",
        "strategy": "LLM",
        "crop_image": "All",
        "llm": "Extract this formula as LaTeX.",
        "embed_sources": ["Content", "LLM"]
    }
}

Convert formula images to LaTeX for better searchability and rendering.

Image Description for RAG

{
    "Picture": {
        "format": "Markdown",
        "strategy": "Auto",
        "crop_image": "All",
        "llm": "Describe this image in detail for a text-based search system.",
        "embed_sources": ["LLM"]
    }
}

Make images searchable by generating detailed text descriptions.

Document Summarization

{
    "Text": {
        "format": "Markdown",
        "strategy": "Auto",
        "llm": "Summarize this section in 2-3 sentences, highlighting key points.",
        "embed_sources": ["LLM", "Content"]
    }
}

Provide summaries alongside full text for multi-level retrieval.

Performance Considerations

VLM processing increases latency and cost. Use strategically for best results.

Tips:

Use strategy: LLM only for complex content (tables, formulas, pictures)
Use strategy: Auto for simple text segments
Set extended_context: false unless you need full page context
Configure max_completion_tokens appropriately to control costs
Use temperature: 0.0 for consistent, deterministic output

Error Handling with VLM

Control how VLM errors are handled:

{
  "error_handling": "Continue"
}

Options:

Fail: Stop processing on any error (default)
Continue: Continue processing despite LLM refusals or failures

Use Continue for fault-tolerant processing when some VLM failures are acceptable

Best Practices

Use VLM strategically - Only apply to segments that benefit from visual understanding
Write clear prompts - Be specific about what you want in custom llm prompts
Enable high resolution - Set high_resolution: true for better VLM input quality
Test and iterate - Experiment with different prompts and configurations
Monitor costs - VLM processing can be expensive at scale
Choose appropriate models - Different models have different strengths and costs

Next Steps

Learn about custom chunking to optimize VLM output usage
See processing documents for core API usage
Review the migration guide for content generation changes

Getting Started

Core Concepts

Configuration

Deployment

Guides

Vision-Language Model Processing

Overview

Generation Strategies

Auto Strategy

LLM Strategy

Default VLM Configuration

Custom LLM Prompts

Extended Context

Configuring LLM Models

LLM Processing Configuration

Image Cropping

Complete VLM Example

Accessing VLM Output

Embed Sources with VLM

Use Cases

Performance Considerations

Error Handling with VLM

Best Practices

Next Steps

Getting Started

Core Concepts

Configuration

Deployment

Guides

Documentation Index

​Overview

​Generation Strategies

​Auto Strategy

​LLM Strategy

​Default VLM Configuration

​Custom LLM Prompts

​Extended Context

​Configuring LLM Models

​LLM Processing Configuration

​Image Cropping

​Complete VLM Example

​Accessing VLM Output

​Embed Sources with VLM

​Use Cases

​Performance Considerations

​Error Handling with VLM

​Best Practices

​Next Steps

Overview

Generation Strategies

Auto Strategy

LLM Strategy

Default VLM Configuration

Custom LLM Prompts

Extended Context

Configuring LLM Models

LLM Processing Configuration

Image Cropping

Complete VLM Example

Accessing VLM Output

Embed Sources with VLM

Use Cases

Performance Considerations

Error Handling with VLM

Best Practices

Next Steps