This guide helps you migrate from Chunkr’s legacy HTML/Markdown dual-generation system to the new unified format approach.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt
Use this file to discover all available pages before exploring further.
What Changed
Chunkr has consolidated content generation to improve performance and simplify the API.Before (Legacy)
Python
After (New)
Python
Key Changes
| Field | Legacy Behavior | New Behavior |
|---|---|---|
segment.content | OCR-extracted text | Generated content (HTML or Markdown) |
segment.text | ❌ Didn’t exist | OCR-extracted text |
segment.html | HTML representation | Still available (backward compatibility) |
segment.markdown | Markdown representation | Still available (backward compatibility) |
Migration Priority
🚨 CRITICAL - Immediate Action Required
1. Pipeline.Chunkr Table Processing Default Change
Impact: Tables no longer generate AI-enhanced markdown by defaultPython
- Applications using
Pipeline.Chunkrwith default table config - Expecting AI-generated
segment.markdownfor tables
- Applications with explicit table configuration
⚠️ MEDIUM - Action Recommended
2. OCR Text Access Breaking Change
Impact: Applications accessing OCR text fromsegment.content will break
Python
- Applications reading
segment.contentexpecting OCR text
💡 LOW - Performance Optimization
3. Update to New Content Access Pattern
Impact: Performance improvement and future-proofingPython
Quick Self-Assessment
Am I affected by Critical Issue #1?- Do I use
Pipeline.Chunkr(layout analysis)? - Do I process tables without explicit configuration?
- Do I expect
segment.markdownto contain AI-enhanced table content? - If yes to all: Add explicit table configuration immediately
- Do I access
segment.contentanywhere in my code? - Do I expect
segment.contentto contain OCR text? - If yes to both: Update to
segment.text
Migration Steps
Update Embed Sources
Replace format-specific embed sources:
Python
Content refers to the generated content based on your chosen formatComplete Migration Example
Legacy Configuration
Python
New Configuration
Python
Backward Compatibility
The API maintains backward compatibility:Configuration Deserialization
Legacy configurations still work:Python
- If both
htmlandmarkdownuseLLM, prefer the default format for that segment type - If one uses
LLMand one usesAuto, use theLLMone - If only one is set, use that format and strategy
Response Fields
All fields remain populated:Python
Benefits of Migration
Performance Improvements
Performance Improvements
- Reduced processing time: Generate only the format you need
- Lower resource usage: Single format generation vs dual format
- Faster API responses: Less content to transfer
Cleaner Architecture
Cleaner Architecture
- Simplified configuration: Choose format once instead of multiple strategies
- Better resource allocation: Focus processing on chosen format
- Clearer content contracts: Know exactly what format you’ll receive
Improved Developer Experience
Improved Developer Experience
- Unified content access: One field for generated content
- Clearer field purposes:
contentfor generated,textfor OCR - Easier embed configuration: Use
Contentinstead of format-specific sources
Common Migration Patterns
Pattern 1: Tables with HTML
Python
Pattern 2: Text with Markdown
Python
Pattern 3: Mixed Formats
Python
Troubleshooting
Tables not generating AI-enhanced content
Tables not generating AI-enhanced content
Problem: Using default config, tables aren’t processed with LLMSolution: Explicitly configure table processing:
Getting wrong content from segment.content
Getting wrong content from segment.content
Problem: Expected OCR text but getting HTML/MarkdownSolution: Use
segment.text for OCR text:Embed sources not working
Embed sources not working
Problem: Using deprecated
HTML or Markdown embed sourcesSolution: Update to Content:Need Help?
If you encounter issues during migration:- Check the processing documents guide for current API usage
- Review custom chunking for segment configuration
- See VLM processing for LLM strategy details
- Contact support at mehul@chunkr.ai
Migration Checklist
- Updated all
segment_processingconfigs to useformatandstrategy - Changed OCR text access from
segment.contenttosegment.text - Updated
embed_sourcesfromHTML/MarkdowntoContent - Added explicit table configuration if using default settings
- Tested with sample documents
- Verified chunk embed content is correct
- Updated documentation and examples
- Deployed and monitored in production