Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt

Use this file to discover all available pages before exploring further.

OCR (Optical Character Recognition) is the process of extracting text from images and documents. Chunkr supports multiple OCR strategies to balance speed and accuracy.

OCR Strategies

Chunkr provides two OCR strategies:
{
  "ocr_strategy": "Auto"
}
Smart OCR with text layer fallbackThe Auto strategy intelligently decides when to apply OCR:How it works:
  1. Check if PDF has embedded text layer
  2. If text layer exists and has content → Use text layer bounding boxes
  3. If text layer is missing/empty → Apply OCR to all pages
Benefits:
  • ✅ Fastest for PDFs with text layers
  • ✅ No OCR latency for digital documents
  • ✅ Preserves original text accuracy
  • ✅ Automatic fallback for scanned documents
Use cases:
  • Digital PDFs (generated from Word, LaTeX, etc.)
  • Mixed documents (some digital, some scanned pages)
  • When speed is important
  • Default recommended strategy
Performance: No latency for digital PDFs, ~0.5s per page for scanned documents

OCR Output Structure

OCR results contain detailed text information:
interface OCRResult {
  bbox: {
    left: number;      // X coordinate
    top: number;       // Y coordinate  
    width: number;     // Bounding box width
    height: number;    // Bounding box height
  };
  text: string;        // Extracted text
  confidence?: number; // OCR confidence (0.0-1.0)
}

Example OCR Results

[
  {
    "bbox": { "left": 72, "top": 120, "width": 450, "height": 24 },
    "text": "Introduction",
    "confidence": 0.98
  },
  {
    "bbox": { "left": 72, "top": 160, "width": 480, "height": 16 },
    "text": "This document describes the architecture of our system.",
    "confidence": 0.95
  }
]

How OCR Works

1. PDF Text Layer Extraction

For the Auto strategy, Chunkr first attempts to extract the PDF text layer: The text layer includes:
  • Text content from PDF
  • Bounding box coordinates
  • Font and styling information (discarded)
Benefits of text layer:
  • Instant extraction (no processing)
  • Perfect accuracy (original text)
  • Exact positioning

2. OCR Processing

When OCR is needed, Chunkr uses advanced OCR models: OCR pipeline steps:
  1. Text Detection: Locate text regions in the image
  2. Text Recognition: Recognize characters in each region
  3. Bounding Box Generation: Create precise coordinates
  4. Confidence Scoring: Assess recognition quality

3. Batched OCR

OCR processes pages in batches for efficiency:
const batchSize = config.general_ocr_batch_size;  // e.g., 10 pages

// Process multiple pages together
for (const batch of pageBatches) {
  const ocrResults = await performGeneralOCR(batch);
}
Benefits:
  • Faster processing for multi-page documents
  • Efficient GPU utilization
  • Reduced API overhead

OCR Quality and Confidence

Confidence Scores

When using ocr_strategy: "All", each OCR result includes a confidence score:
  • > 0.95: Excellent quality, very reliable
  • 0.85 - 0.95: Good quality, usually accurate
  • 0.70 - 0.85: Medium quality, mostly correct
  • < 0.70: Low quality, may have errors
Filter low-confidence results for quality control:
const highQualityText = ocrResults
  .filter(r => r.confidence > 0.9)
  .map(r => r.text)
  .join(' ');

Factors Affecting OCR Quality

Improves OCR accuracy:
  • ✅ High-resolution images (high_resolution: true)
  • ✅ Clear, readable fonts
  • ✅ Good contrast (dark text on light background)
  • ✅ Horizontal text orientation
  • ✅ Standard languages (English, etc.)
Reduces OCR accuracy:
  • ❌ Low-resolution or blurry images
  • ❌ Decorative or handwritten fonts
  • ❌ Poor contrast or faded text
  • ❌ Rotated or skewed text
  • ❌ Complex backgrounds

OCR Result Assignment

OCR results are assigned to segments during layout analysis:
// 1. Add padding to segment bounding boxes
const paddedBbox = {
  left: segment.left - padding,
  top: segment.top - padding,
  width: segment.width + padding * 2,
  height: segment.height + padding * 2
};

// 2. Find best matching segment for each OCR result
for (const ocr of ocrResults) {
  let bestSegment = null;
  let bestArea = 0;
  
  for (const segment of segments) {
    const intersectionArea = calculateIntersection(
      ocr.bbox, 
      paddedBbox[segment]
    );
    
    if (intersectionArea > bestArea) {
      bestArea = intersectionArea;
      bestSegment = segment;
    }
  }
  
  // 3. Assign to best matching segment
  if (bestSegment) {
    bestSegment.ocr.push({
      ...ocr,
      // Adjust coordinates relative to segment
      bbox: {
        left: ocr.bbox.left - bestSegment.bbox.left,
        top: ocr.bbox.top - bestSegment.bbox.top,
        width: ocr.bbox.width,
        height: ocr.bbox.height
      }
    });
  }
}

// 4. Generate segment text from OCR
segment.text = segment.ocr
  .map(r => r.text)
  .join(' ');
OCR coordinates are converted from absolute page coordinates to segment-relative coordinates for easier processing.

Text Layer vs OCR

When to Use Each

Use Auto (Text Layer)

Best for:
  • Digital PDFs
  • Generated documents
  • Speed-critical applications
  • High-accuracy requirements
Documents:
  • Word/Google Docs exports
  • LaTeX-generated PDFs
  • Web page prints
  • Software-generated reports

Use All (OCR)

Best for:
  • Scanned documents
  • Images of documents
  • Inconsistent quality
  • Need confidence scores
Documents:
  • Scanned paperwork
  • Photos of documents
  • Faxes
  • Historical documents

Performance Comparison

StrategyDigital PDFScanned PDFImage
AutoInstant~0.5s/page~0.5s/page
All~0.5s/page~0.5s/page~0.5s/page
For a 10-page digital PDF:
  • Auto: ~0 seconds (uses text layer)
  • All: ~5 seconds (OCR all pages)
For a 10-page scanned PDF:
  • Auto: ~5 seconds (OCR all pages)
  • All: ~5 seconds (OCR all pages)

Error Handling

Chunkr includes fallback mechanisms for OCR failures:

With error_handling: "Fail"

try {
  ocrResults = await performOCR(pages);
} catch (error) {
  // Task fails immediately
  throw new Error('OCR processing failed');
}

With error_handling: "Continue"

try {
  ocrResults = await performOCR(pages);
} catch (error) {
  console.log('OCR failed, using PDF text layer fallback');
  // Falls back to PDF text layer
  ocrResults = extractPDFTextLayer(pdf);
  
  // If text layer also empty, continue with empty OCR
  if (ocrResults.every(r => r.length === 0)) {
    ocrResults = pages.map(() => []);
  }
}
Fallback order:
  1. Try primary OCR
  2. Fall back to PDF text layer
  3. Continue with empty OCR (segmentation only)

Configuration Examples

Digital Documents (Fastest)

{
  "ocr_strategy": "Auto",
  "high_resolution": false
}
Uses text layer when available, minimal processing time.

Scanned Documents (Balanced)

{
  "ocr_strategy": "All",
  "high_resolution": true
}
Always applies OCR with high-resolution images for best accuracy.
{
  "ocr_strategy": "Auto",
  "high_resolution": true,
  "error_handling": "Continue"
}
Smart strategy with high quality and graceful fallbacks.

Speed-Optimized

{
  "ocr_strategy": "Auto",
  "high_resolution": false,
  "segmentation_strategy": "Page"
}
Minimizes processing time for simple documents.

Advanced Topics

Coordinate Scaling

OCR coordinates are scaled based on resolution settings:
const scalingFactor = config.high_resolution 
  ? config.high_res_scaling_factor  // e.g., 2.0
  : 1.0;

// Scale OCR bounding boxes
ocrResults.forEach(result => {
  result.bbox.left *= scalingFactor;
  result.bbox.top *= scalingFactor;
  result.bbox.width *= scalingFactor;
  result.bbox.height *= scalingFactor;
});
This ensures OCR coordinates match the processed image dimensions.

Multi-Language Support

Chunkr’s OCR models support multiple languages:
  • ✅ English (primary)
  • ✅ Spanish, French, German
  • ✅ Chinese, Japanese, Korean
  • ✅ Many other languages
Language is automatically detected - no configuration needed.

Monitoring OCR Quality

Track OCR quality in your application:
// Calculate average confidence per segment
const avgConfidence = segment.ocr
  .filter(r => r.confidence !== undefined)
  .reduce((sum, r) => sum + r.confidence, 0) / segment.ocr.length;

if (avgConfidence < 0.8) {
  console.warn(`Low OCR quality for segment ${segment.segment_id}`);
}

// Count low-confidence words
const lowConfidenceWords = segment.ocr
  .filter(r => r.confidence < 0.7)
  .length;

Next Steps

Segmentation

Learn how OCR results are assigned to segments

Pipelines

Understand the complete processing pipeline

Chunking

See how text is combined into chunks

API Reference

Complete API documentation