Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt

Use this file to discover all available pages before exploring further.

This guide helps you migrate from Chunkr’s legacy HTML/Markdown dual-generation system to the new unified format approach.

What Changed

Chunkr has consolidated content generation to improve performance and simplify the API.

Before (Legacy)

Python
# Could generate both formats simultaneously
config = {
    "segment_processing": {
        "Table": {
            "html": "LLM",      # Generate HTML
            "markdown": "Auto"  # Also generate Markdown
        }
    }
}

# Content in separate fields
html_content = segment["html"]
markdown_content = segment["markdown"]
ocr_text = segment["content"]  # OCR text was here

After (New)

Python
# Choose one format per segment type
config = {
    "segment_processing": {
        "Table": {
            "format": "Html",     # Choose format
            "strategy": "LLM"     # Choose generation strategy
        }
    }
}

# Content in unified field
generated_content = segment["content"]  # HTML or Markdown (your choice)
ocr_text = segment["text"]              # OCR text moved here

Key Changes

The purpose of segment.content has completely changed - it now contains generated HTML/Markdown instead of OCR text
FieldLegacy BehaviorNew Behavior
segment.contentOCR-extracted textGenerated content (HTML or Markdown)
segment.text❌ Didn’t existOCR-extracted text
segment.htmlHTML representationStill available (backward compatibility)
segment.markdownMarkdown representationStill available (backward compatibility)

Migration Priority

🚨 CRITICAL - Immediate Action Required

1. Pipeline.Chunkr Table Processing Default Change

Impact: Tables no longer generate AI-enhanced markdown by default
Python
# ❌ BROKEN - Default config no longer generates AI markdown for tables
# Old behavior: Tables got both HTML and AI-enhanced markdown automatically

# ✅ FIXED - Explicitly configure AI table processing
config = {
    "segment_processing": {
        "Table": {
            "format": "Markdown",
            "strategy": "LLM"
        }
    }
}
Who’s affected:
  • Applications using Pipeline.Chunkr with default table config
  • Expecting AI-generated segment.markdown for tables
Who’s not affected:
  • Applications with explicit table configuration

2. OCR Text Access Breaking Change

Impact: Applications accessing OCR text from segment.content will break
Python
# ❌ BROKEN - Will now return HTML/Markdown instead of OCR text
ocr_text = segment["content"]

# ✅ FIXED - Update to use segment.text
ocr_text = segment["text"]
Who’s affected:
  • Applications reading segment.content expecting OCR text

💡 LOW - Performance Optimization

3. Update to New Content Access Pattern

Impact: Performance improvement and future-proofing
Python
# 📊 OLD - Accessing format-specific fields
html_content = segment["html"]
markdown_content = segment["markdown"]

# ⚡ NEW - Access unified content field
generated_content = segment["content"]  # Contains format you requested
Who’s affected: All applications (optional upgrade)

Quick Self-Assessment

Am I affected by Critical Issue #1?
  • Do I use Pipeline.Chunkr (layout analysis)?
  • Do I process tables without explicit configuration?
  • Do I expect segment.markdown to contain AI-enhanced table content?
  • If yes to all: Add explicit table configuration immediately
Am I affected by Medium Issue #2?
  • Do I access segment.content anywhere in my code?
  • Do I expect segment.content to contain OCR text?
  • If yes to both: Update to segment.text

Migration Steps

1

Update Configuration Format

Replace legacy html and markdown fields with format and strategy:
Python
# Legacy
old_config = {
    "segment_processing": {
        "Page": {
            "html": "LLM",
            "markdown": "LLM",
            "embed_sources": ["Markdown"]
        }
    }
}

# New
new_config = {
    "segment_processing": {
        "Page": {
            "format": "Markdown",
            "strategy": "LLM",
            "embed_sources": ["Content"]
        }
    }
}
2

Update Content Access

Change how you access OCR text:
Python
# Legacy
for segment in chunk["segments"]:
    ocr_text = segment["content"]
    html = segment["html"]
    markdown = segment["markdown"]

# New
for segment in chunk["segments"]:
    ocr_text = segment["text"]        # Changed!
    generated = segment["content"]     # Now contains HTML or Markdown
    html = segment["html"]             # Still available
    markdown = segment["markdown"]     # Still available
3

Update Embed Sources

Replace format-specific embed sources:
Python
# Legacy
old_embed = {
    "embed_sources": ["HTML", "Markdown", "LLM"]
}

# New
new_embed = {
    "embed_sources": ["Content", "LLM"]
}
Content refers to the generated content based on your chosen format
4

Test Thoroughly

Verify that:
  • Table processing generates expected content
  • OCR text is correctly accessed from segment.text
  • Embed fields contain expected content
  • Chunk lengths are calculated correctly

Complete Migration Example

Legacy Configuration

Python
import requests

# Legacy approach
legacy_config = {
    "file": "https://example.com/document.pdf",
    "segment_processing": {
        "Page": {
            "html": "LLM",
            "markdown": "LLM",
            "embed_sources": ["Markdown"]
        },
        "Table": {
            "html": "Auto",
            "markdown": "LLM",
            "embed_sources": ["HTML", "Markdown"]
        },
        "Picture": {
            "html": "Auto",
            "markdown": "Auto",
            "embed_sources": ["Markdown"]
        }
    }
}

# Legacy content access
for chunk in task["output"]["chunks"]:
    for segment in chunk["segments"]:
        ocr_text = segment["content"]      # OCR text
        html = segment["html"]
        markdown = segment["markdown"]

New Configuration

Python
import requests

# New approach
new_config = {
    "file": "https://example.com/document.pdf",
    "segment_processing": {
        "Page": {
            "format": "Markdown",
            "strategy": "LLM",
            "embed_sources": ["Content"]
        },
        "Table": {
            "format": "Html",
            "strategy": "LLM",
            "embed_sources": ["Content"]
        },
        "Picture": {
            "format": "Markdown",
            "strategy": "Auto",
            "embed_sources": ["Content"]
        }
    }
}

response = requests.post(
    "https://api.chunkr.ai/api/v1/task/parse",
    headers={"Authorization": "YOUR_API_KEY"},
    json=new_config
)

# New content access
for chunk in task["output"]["chunks"]:
    for segment in chunk["segments"]:
        ocr_text = segment["text"]         # OCR text moved here!
        generated = segment["content"]      # HTML or Markdown based on format
        
        # Backward compatible fields still available
        html = segment["html"]
        markdown = segment["markdown"]

Backward Compatibility

The API maintains backward compatibility:

Configuration Deserialization

Legacy configurations still work:
Python
# This legacy config still works
legacy = {
    "segment_processing": {
        "Table": {
            "html": "LLM",
            "markdown": "Auto"
        }
    }
}
Resolution logic:
  1. If both html and markdown use LLM, prefer the default format for that segment type
  2. If one uses LLM and one uses Auto, use the LLM one
  3. If only one is set, use that format and strategy

Response Fields

All fields remain populated:
Python
# All fields are still available in responses
segment = {
    "content": "<table>...</table>",  # Generated content (NEW purpose)
    "text": "Raw OCR text",           # OCR text (NEW field)
    "html": "<table>...</table>",     # HTML (backward compatibility)
    "markdown": "| A | B |\n...",   # Markdown (backward compatibility)
}

Benefits of Migration

  • Reduced processing time: Generate only the format you need
  • Lower resource usage: Single format generation vs dual format
  • Faster API responses: Less content to transfer
  • Simplified configuration: Choose format once instead of multiple strategies
  • Better resource allocation: Focus processing on chosen format
  • Clearer content contracts: Know exactly what format you’ll receive
  • Unified content access: One field for generated content
  • Clearer field purposes: content for generated, text for OCR
  • Easier embed configuration: Use Content instead of format-specific sources

Common Migration Patterns

Pattern 1: Tables with HTML

Python
# Before
{"Table": {"html": "LLM", "markdown": "Auto"}}

# After
{"Table": {"format": "Html", "strategy": "LLM"}}

Pattern 2: Text with Markdown

Python
# Before
{"Text": {"html": "Auto", "markdown": "Auto"}}

# After
{"Text": {"format": "Markdown", "strategy": "Auto"}}

Pattern 3: Mixed Formats

Python
# Before
{
    "segment_processing": {
        "Table": {"html": "LLM", "markdown": "Auto"},
        "Text": {"html": "Auto", "markdown": "Auto"}
    }
}

# After
{
    "segment_processing": {
        "Table": {"format": "Html", "strategy": "LLM"},
        "Text": {"format": "Markdown", "strategy": "Auto"}
    }
}

Troubleshooting

Problem: Using default config, tables aren’t processed with LLMSolution: Explicitly configure table processing:
{
    "segment_processing": {
        "Table": {
            "format": "Html",
            "strategy": "LLM"
        }
    }
}
Problem: Expected OCR text but getting HTML/MarkdownSolution: Use segment.text for OCR text:
ocr_text = segment["text"]  # Not segment["content"]
Problem: Using deprecated HTML or Markdown embed sourcesSolution: Update to Content:
{"embed_sources": ["Content", "LLM"]}  # Not ["HTML", "Markdown"]

Need Help?

If you encounter issues during migration:

Migration Checklist

  • Updated all segment_processing configs to use format and strategy
  • Changed OCR text access from segment.content to segment.text
  • Updated embed_sources from HTML/Markdown to Content
  • Added explicit table configuration if using default settings
  • Tested with sample documents
  • Verified chunk embed content is correct
  • Updated documentation and examples
  • Deployed and monitored in production