Migration Guide

What Changed
Before (Legacy)
After (New)
Key Changes
Migration Priority
🚨 CRITICAL - Immediate Action Required
1. Pipeline.Chunkr Table Processing Default Change
⚠️ MEDIUM - Action Recommended
2. OCR Text Access Breaking Change
💡 LOW - Performance Optimization
3. Update to New Content Access Pattern
Quick Self-Assessment
Migration Steps
Complete Migration Example
Legacy Configuration
New Configuration
Backward Compatibility
Configuration Deserialization
Response Fields
Benefits of Migration
Common Migration Patterns
Pattern 1: Tables with HTML
Pattern 2: Text with Markdown
Pattern 3: Mixed Formats
Troubleshooting
Need Help?
Migration Checklist

This guide helps you migrate from Chunkr’s legacy HTML/Markdown dual-generation system to the new unified format approach.

What Changed

Chunkr has consolidated content generation to improve performance and simplify the API.

Before (Legacy)

Python

# Could generate both formats simultaneously
config = {
    "segment_processing": {
        "Table": {
            "html": "LLM",      # Generate HTML
            "markdown": "Auto"  # Also generate Markdown
        }
    }
}

# Content in separate fields
html_content = segment["html"]
markdown_content = segment["markdown"]
ocr_text = segment["content"]  # OCR text was here

After (New)

Python

# Choose one format per segment type
config = {
    "segment_processing": {
        "Table": {
            "format": "Html",     # Choose format
            "strategy": "LLM"     # Choose generation strategy
        }
    }
}

# Content in unified field
generated_content = segment["content"]  # HTML or Markdown (your choice)
ocr_text = segment["text"]              # OCR text moved here

Key Changes

The purpose of segment.content has completely changed - it now contains generated HTML/Markdown instead of OCR text

Field	Legacy Behavior	New Behavior
`segment.content`	OCR-extracted text	Generated content (HTML or Markdown)
`segment.text`	❌ Didn’t exist	OCR-extracted text
`segment.html`	HTML representation	Still available (backward compatibility)
`segment.markdown`	Markdown representation	Still available (backward compatibility)

Migration Priority

🚨 CRITICAL - Immediate Action Required

1. Pipeline.Chunkr Table Processing Default Change

Impact: Tables no longer generate AI-enhanced markdown by default

Python

# ❌ BROKEN - Default config no longer generates AI markdown for tables
# Old behavior: Tables got both HTML and AI-enhanced markdown automatically

# ✅ FIXED - Explicitly configure AI table processing
config = {
    "segment_processing": {
        "Table": {
            "format": "Markdown",
            "strategy": "LLM"
        }
    }
}

Who’s affected:

Applications using Pipeline.Chunkr with default table config
Expecting AI-generated segment.markdown for tables

Who’s not affected:

Applications with explicit table configuration

⚠️ MEDIUM - Action Recommended

2. OCR Text Access Breaking Change

Impact: Applications accessing OCR text from segment.content will break

Python

# ❌ BROKEN - Will now return HTML/Markdown instead of OCR text
ocr_text = segment["content"]

# ✅ FIXED - Update to use segment.text
ocr_text = segment["text"]

Who’s affected:

Applications reading segment.content expecting OCR text

💡 LOW - Performance Optimization

3. Update to New Content Access Pattern

Impact: Performance improvement and future-proofing

Python

# 📊 OLD - Accessing format-specific fields
html_content = segment["html"]
markdown_content = segment["markdown"]

# ⚡ NEW - Access unified content field
generated_content = segment["content"]  # Contains format you requested

Who’s affected: All applications (optional upgrade)

Quick Self-Assessment

Am I affected by Critical Issue #1?

Do I use Pipeline.Chunkr (layout analysis)?
Do I process tables without explicit configuration?
Do I expect segment.markdown to contain AI-enhanced table content?
If yes to all: Add explicit table configuration immediately

Am I affected by Medium Issue #2?

Do I access segment.content anywhere in my code?
Do I expect segment.content to contain OCR text?
If yes to both: Update to segment.text

Migration Steps

Update Configuration Format

Replace legacy html and markdown fields with format and strategy:

Python

# Legacy
old_config = {
    "segment_processing": {
        "Page": {
            "html": "LLM",
            "markdown": "LLM",
            "embed_sources": ["Markdown"]
        }
    }
}

# New
new_config = {
    "segment_processing": {
        "Page": {
            "format": "Markdown",
            "strategy": "LLM",
            "embed_sources": ["Content"]
        }
    }
}

Update Content Access

Change how you access OCR text:

Python

# Legacy
for segment in chunk["segments"]:
    ocr_text = segment["content"]
    html = segment["html"]
    markdown = segment["markdown"]

# New
for segment in chunk["segments"]:
    ocr_text = segment["text"]        # Changed!
    generated = segment["content"]     # Now contains HTML or Markdown
    html = segment["html"]             # Still available
    markdown = segment["markdown"]     # Still available

Update Embed Sources

Replace format-specific embed sources:

Python

# Legacy
old_embed = {
    "embed_sources": ["HTML", "Markdown", "LLM"]
}

# New
new_embed = {
    "embed_sources": ["Content", "LLM"]
}

Content refers to the generated content based on your chosen format

Test Thoroughly

Verify that:

Table processing generates expected content
OCR text is correctly accessed from segment.text
Embed fields contain expected content
Chunk lengths are calculated correctly

Complete Migration Example

Legacy Configuration

Python

import requests

# Legacy approach
legacy_config = {
    "file": "https://example.com/document.pdf",
    "segment_processing": {
        "Page": {
            "html": "LLM",
            "markdown": "LLM",
            "embed_sources": ["Markdown"]
        },
        "Table": {
            "html": "Auto",
            "markdown": "LLM",
            "embed_sources": ["HTML", "Markdown"]
        },
        "Picture": {
            "html": "Auto",
            "markdown": "Auto",
            "embed_sources": ["Markdown"]
        }
    }
}

# Legacy content access
for chunk in task["output"]["chunks"]:
    for segment in chunk["segments"]:
        ocr_text = segment["content"]      # OCR text
        html = segment["html"]
        markdown = segment["markdown"]

New Configuration

Python

import requests

# New approach
new_config = {
    "file": "https://example.com/document.pdf",
    "segment_processing": {
        "Page": {
            "format": "Markdown",
            "strategy": "LLM",
            "embed_sources": ["Content"]
        },
        "Table": {
            "format": "Html",
            "strategy": "LLM",
            "embed_sources": ["Content"]
        },
        "Picture": {
            "format": "Markdown",
            "strategy": "Auto",
            "embed_sources": ["Content"]
        }
    }
}

response = requests.post(
    "https://api.chunkr.ai/api/v1/task/parse",
    headers={"Authorization": "YOUR_API_KEY"},
    json=new_config
)

# New content access
for chunk in task["output"]["chunks"]:
    for segment in chunk["segments"]:
        ocr_text = segment["text"]         # OCR text moved here!
        generated = segment["content"]      # HTML or Markdown based on format
        
        # Backward compatible fields still available
        html = segment["html"]
        markdown = segment["markdown"]

Backward Compatibility

The API maintains backward compatibility:

Configuration Deserialization

Legacy configurations still work:

Python

# This legacy config still works
legacy = {
    "segment_processing": {
        "Table": {
            "html": "LLM",
            "markdown": "Auto"
        }
    }
}

Resolution logic:

If both html and markdown use LLM, prefer the default format for that segment type
If one uses LLM and one uses Auto, use the LLM one
If only one is set, use that format and strategy

Response Fields

All fields remain populated:

Python

# All fields are still available in responses
segment = {
    "content": "<table>...</table>",  # Generated content (NEW purpose)
    "text": "Raw OCR text",           # OCR text (NEW field)
    "html": "<table>...</table>",     # HTML (backward compatibility)
    "markdown": "| A | B |\n...",   # Markdown (backward compatibility)
}

Benefits of Migration

Performance Improvements

Reduced processing time: Generate only the format you need
Lower resource usage: Single format generation vs dual format
Faster API responses: Less content to transfer

Cleaner Architecture

Simplified configuration: Choose format once instead of multiple strategies
Better resource allocation: Focus processing on chosen format
Clearer content contracts: Know exactly what format you’ll receive

Improved Developer Experience

Unified content access: One field for generated content
Clearer field purposes: content for generated, text for OCR
Easier embed configuration: Use Content instead of format-specific sources

Common Migration Patterns

Pattern 1: Tables with HTML

Python

# Before
{"Table": {"html": "LLM", "markdown": "Auto"}}

# After
{"Table": {"format": "Html", "strategy": "LLM"}}

Pattern 2: Text with Markdown

Python

# Before
{"Text": {"html": "Auto", "markdown": "Auto"}}

# After
{"Text": {"format": "Markdown", "strategy": "Auto"}}

Pattern 3: Mixed Formats

Python

# Before
{
    "segment_processing": {
        "Table": {"html": "LLM", "markdown": "Auto"},
        "Text": {"html": "Auto", "markdown": "Auto"}
    }
}

# After
{
    "segment_processing": {
        "Table": {"format": "Html", "strategy": "LLM"},
        "Text": {"format": "Markdown", "strategy": "Auto"}
    }
}

Troubleshooting

Tables not generating AI-enhanced content

Problem: Using default config, tables aren’t processed with LLMSolution: Explicitly configure table processing:

{
    "segment_processing": {
        "Table": {
            "format": "Html",
            "strategy": "LLM"
        }
    }
}

Getting wrong content from segment.content

Problem: Expected OCR text but getting HTML/MarkdownSolution: Use segment.text for OCR text:

ocr_text = segment["text"]  # Not segment["content"]

Embed sources not working

Problem: Using deprecated HTML or Markdown embed sourcesSolution: Update to Content:

{"embed_sources": ["Content", "LLM"]}  # Not ["HTML", "Markdown"]

Need Help?

If you encounter issues during migration:

Check the processing documents guide for current API usage
Review custom chunking for segment configuration
See VLM processing for LLM strategy details
Contact support at mehul@chunkr.ai

Migration Checklist

Updated all segment_processing configs to use format and strategy
Changed OCR text access from segment.content to segment.text
Updated embed_sources from HTML/Markdown to Content
Added explicit table configuration if using default settings
Tested with sample documents
Verified chunk embed content is correct
Updated documentation and examples
Deployed and monitored in production

Vision-Language Model Processing

⌘I

Getting Started

Core Concepts

Configuration

Deployment

Guides

Migration Guide

What Changed

Before (Legacy)

After (New)

Key Changes

Migration Priority

🚨 CRITICAL - Immediate Action Required

1. Pipeline.Chunkr Table Processing Default Change

⚠️ MEDIUM - Action Recommended

2. OCR Text Access Breaking Change

💡 LOW - Performance Optimization

3. Update to New Content Access Pattern

Quick Self-Assessment

Migration Steps

Complete Migration Example

Legacy Configuration

New Configuration

Backward Compatibility

Configuration Deserialization

Response Fields

Benefits of Migration

Common Migration Patterns

Pattern 1: Tables with HTML

Pattern 2: Text with Markdown

Pattern 3: Mixed Formats

Troubleshooting

Need Help?

Migration Checklist

Getting Started

Core Concepts

Configuration

Deployment

Guides

Documentation Index

​What Changed

​Before (Legacy)

​After (New)

​Key Changes

​Migration Priority

​🚨 CRITICAL - Immediate Action Required

​1. Pipeline.Chunkr Table Processing Default Change

​⚠️ MEDIUM - Action Recommended

​2. OCR Text Access Breaking Change

​💡 LOW - Performance Optimization

​3. Update to New Content Access Pattern

​Quick Self-Assessment

​Migration Steps

​Complete Migration Example

​Legacy Configuration

​New Configuration

​Backward Compatibility

​Configuration Deserialization

​Response Fields

​Benefits of Migration

​Common Migration Patterns

​Pattern 1: Tables with HTML

​Pattern 2: Text with Markdown

​Pattern 3: Mixed Formats

​Troubleshooting

​Need Help?

​Migration Checklist

What Changed

Before (Legacy)

After (New)

Key Changes

Migration Priority

🚨 CRITICAL - Immediate Action Required

1. Pipeline.Chunkr Table Processing Default Change

⚠️ MEDIUM - Action Recommended

2. OCR Text Access Breaking Change

💡 LOW - Performance Optimization

3. Update to New Content Access Pattern

Quick Self-Assessment

Migration Steps

Complete Migration Example

Legacy Configuration

New Configuration

Backward Compatibility

Configuration Deserialization

Response Fields

Benefits of Migration

Common Migration Patterns

Pattern 1: Tables with HTML

Pattern 2: Text with Markdown

Pattern 3: Mixed Formats

Troubleshooting

Need Help?

Migration Checklist