OCR (Optical Character Recognition) is the process of extracting text from images and documents. Chunkr supports multiple OCR strategies to balance speed and accuracy.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt
Use this file to discover all available pages before exploring further.
OCR Strategies
Chunkr provides two OCR strategies:- Auto
- All
Smart OCR with text layer fallbackThe
Auto strategy intelligently decides when to apply OCR:How it works:- Check if PDF has embedded text layer
- If text layer exists and has content → Use text layer bounding boxes
- If text layer is missing/empty → Apply OCR to all pages
- ✅ Fastest for PDFs with text layers
- ✅ No OCR latency for digital documents
- ✅ Preserves original text accuracy
- ✅ Automatic fallback for scanned documents
- Digital PDFs (generated from Word, LaTeX, etc.)
- Mixed documents (some digital, some scanned pages)
- When speed is important
- Default recommended strategy
OCR Output Structure
OCR results contain detailed text information:Example OCR Results
How OCR Works
1. PDF Text Layer Extraction
For theAuto strategy, Chunkr first attempts to extract the PDF text layer:
The text layer includes:
- Text content from PDF
- Bounding box coordinates
- Font and styling information (discarded)
- Instant extraction (no processing)
- Perfect accuracy (original text)
- Exact positioning
2. OCR Processing
When OCR is needed, Chunkr uses advanced OCR models: OCR pipeline steps:- Text Detection: Locate text regions in the image
- Text Recognition: Recognize characters in each region
- Bounding Box Generation: Create precise coordinates
- Confidence Scoring: Assess recognition quality
3. Batched OCR
OCR processes pages in batches for efficiency:- Faster processing for multi-page documents
- Efficient GPU utilization
- Reduced API overhead
OCR Quality and Confidence
Confidence Scores
When usingocr_strategy: "All", each OCR result includes a confidence score:
- > 0.95: Excellent quality, very reliable
- 0.85 - 0.95: Good quality, usually accurate
- 0.70 - 0.85: Medium quality, mostly correct
- < 0.70: Low quality, may have errors
Factors Affecting OCR Quality
Improves OCR accuracy:- ✅ High-resolution images (
high_resolution: true) - ✅ Clear, readable fonts
- ✅ Good contrast (dark text on light background)
- ✅ Horizontal text orientation
- ✅ Standard languages (English, etc.)
- ❌ Low-resolution or blurry images
- ❌ Decorative or handwritten fonts
- ❌ Poor contrast or faded text
- ❌ Rotated or skewed text
- ❌ Complex backgrounds
OCR Result Assignment
OCR results are assigned to segments during layout analysis:OCR coordinates are converted from absolute page coordinates to segment-relative coordinates for easier processing.
Text Layer vs OCR
When to Use Each
Use Auto (Text Layer)
Best for:
- Digital PDFs
- Generated documents
- Speed-critical applications
- High-accuracy requirements
- Word/Google Docs exports
- LaTeX-generated PDFs
- Web page prints
- Software-generated reports
Use All (OCR)
Best for:
- Scanned documents
- Images of documents
- Inconsistent quality
- Need confidence scores
- Scanned paperwork
- Photos of documents
- Faxes
- Historical documents
Performance Comparison
| Strategy | Digital PDF | Scanned PDF | Image |
|---|---|---|---|
| Auto | Instant | ~0.5s/page | ~0.5s/page |
| All | ~0.5s/page | ~0.5s/page | ~0.5s/page |
For a 10-page digital PDF:
- Auto: ~0 seconds (uses text layer)
- All: ~5 seconds (OCR all pages)
- Auto: ~5 seconds (OCR all pages)
- All: ~5 seconds (OCR all pages)
Error Handling
Chunkr includes fallback mechanisms for OCR failures:With error_handling: "Fail"
With error_handling: "Continue"
- Try primary OCR
- Fall back to PDF text layer
- Continue with empty OCR (segmentation only)
Configuration Examples
Digital Documents (Fastest)
Scanned Documents (Balanced)
Mixed Documents (Recommended)
Speed-Optimized
Advanced Topics
Coordinate Scaling
OCR coordinates are scaled based on resolution settings:Multi-Language Support
Chunkr’s OCR models support multiple languages:- ✅ English (primary)
- ✅ Spanish, French, German
- ✅ Chinese, Japanese, Korean
- ✅ Many other languages
Monitoring OCR Quality
Track OCR quality in your application:Next Steps
Segmentation
Learn how OCR results are assigned to segments
Pipelines
Understand the complete processing pipeline
Chunking
See how text is combined into chunks
API Reference
Complete API documentation