OCR

OCR (Optical Character Recognition) is the process of extracting text from images and documents. Chunkr supports multiple OCR strategies to balance speed and accuracy.

OCR Strategies

Chunkr provides two OCR strategies:

{
  "ocr_strategy": "Auto"
}

Auto
All

Smart OCR with text layer fallbackThe Auto strategy intelligently decides when to apply OCR:How it works:

Check if PDF has embedded text layer
If text layer exists and has content → Use text layer bounding boxes
If text layer is missing/empty → Apply OCR to all pages

Benefits:

✅ Fastest for PDFs with text layers
✅ No OCR latency for digital documents
✅ Preserves original text accuracy
✅ Automatic fallback for scanned documents

Use cases:

Digital PDFs (generated from Word, LaTeX, etc.)
Mixed documents (some digital, some scanned pages)
When speed is important
Default recommended strategy

Performance: No latency for digital PDFs, ~0.5s per page for scanned documents

Always apply OCRThe All strategy applies OCR to every page regardless of text layer:How it works:

Convert each page to image
Run OCR model on image
Extract text with bounding boxes and confidence scores

Benefits:

✅ Consistent processing for all documents
✅ Confidence scores for text quality
✅ Better handling of complex layouts
✅ Can improve accuracy for poorly rendered text

Use cases:

Scanned documents
Images of documents
When you need OCR confidence scores
Documents with rendering issues

Performance: Adds ~0.5 seconds per page

OCR Output Structure

OCR results contain detailed text information:

interface OCRResult {
  bbox: {
    left: number;      // X coordinate
    top: number;       // Y coordinate  
    width: number;     // Bounding box width
    height: number;    // Bounding box height
  };
  text: string;        // Extracted text
  confidence?: number; // OCR confidence (0.0-1.0)
}

Example OCR Results

[
  {
    "bbox": { "left": 72, "top": 120, "width": 450, "height": 24 },
    "text": "Introduction",
    "confidence": 0.98
  },
  {
    "bbox": { "left": 72, "top": 160, "width": 480, "height": 16 },
    "text": "This document describes the architecture of our system.",
    "confidence": 0.95
  }
]

How OCR Works

1. PDF Text Layer Extraction

For the Auto strategy, Chunkr first attempts to extract the PDF text layer: The text layer includes:

Text content from PDF
Bounding box coordinates
Font and styling information (discarded)

Benefits of text layer:

Instant extraction (no processing)
Perfect accuracy (original text)
Exact positioning

2. OCR Processing

When OCR is needed, Chunkr uses advanced OCR models: OCR pipeline steps:

Text Detection: Locate text regions in the image
Text Recognition: Recognize characters in each region
Bounding Box Generation: Create precise coordinates
Confidence Scoring: Assess recognition quality

3. Batched OCR

OCR processes pages in batches for efficiency:

const batchSize = config.general_ocr_batch_size;  // e.g., 10 pages

// Process multiple pages together
for (const batch of pageBatches) {
  const ocrResults = await performGeneralOCR(batch);
}

Benefits:

Faster processing for multi-page documents
Efficient GPU utilization
Reduced API overhead

OCR Quality and Confidence

Confidence Scores

When using ocr_strategy: "All", each OCR result includes a confidence score:

> 0.95: Excellent quality, very reliable
0.85 - 0.95: Good quality, usually accurate
0.70 - 0.85: Medium quality, mostly correct
< 0.70: Low quality, may have errors

Filter low-confidence results for quality control:

const highQualityText = ocrResults
  .filter(r => r.confidence > 0.9)
  .map(r => r.text)
  .join(' ');

Factors Affecting OCR Quality

Improves OCR accuracy:

✅ High-resolution images (high_resolution: true)
✅ Clear, readable fonts
✅ Good contrast (dark text on light background)
✅ Horizontal text orientation
✅ Standard languages (English, etc.)

Reduces OCR accuracy:

❌ Low-resolution or blurry images
❌ Decorative or handwritten fonts
❌ Poor contrast or faded text
❌ Rotated or skewed text
❌ Complex backgrounds

OCR Result Assignment

OCR results are assigned to segments during layout analysis:

// 1. Add padding to segment bounding boxes
const paddedBbox = {
  left: segment.left - padding,
  top: segment.top - padding,
  width: segment.width + padding * 2,
  height: segment.height + padding * 2
};

// 2. Find best matching segment for each OCR result
for (const ocr of ocrResults) {
  let bestSegment = null;
  let bestArea = 0;
  
  for (const segment of segments) {
    const intersectionArea = calculateIntersection(
      ocr.bbox, 
      paddedBbox[segment]
    );
    
    if (intersectionArea > bestArea) {
      bestArea = intersectionArea;
      bestSegment = segment;
    }
  }
  
  // 3. Assign to best matching segment
  if (bestSegment) {
    bestSegment.ocr.push({
      ...ocr,
      // Adjust coordinates relative to segment
      bbox: {
        left: ocr.bbox.left - bestSegment.bbox.left,
        top: ocr.bbox.top - bestSegment.bbox.top,
        width: ocr.bbox.width,
        height: ocr.bbox.height
      }
    });
  }
}

// 4. Generate segment text from OCR
segment.text = segment.ocr
  .map(r => r.text)
  .join(' ');

OCR coordinates are converted from absolute page coordinates to segment-relative coordinates for easier processing.

Text Layer vs OCR

When to Use Each

Use Auto (Text Layer)

Best for:

Digital PDFs
Generated documents
Speed-critical applications
High-accuracy requirements

Documents:

Word/Google Docs exports
LaTeX-generated PDFs
Web page prints
Software-generated reports

Use All (OCR)

Best for:

Scanned documents
Images of documents
Inconsistent quality
Need confidence scores

Documents:

Scanned paperwork
Photos of documents
Faxes
Historical documents

Performance Comparison

Strategy	Digital PDF	Scanned PDF	Image
Auto	Instant	~0.5s/page	~0.5s/page
All	~0.5s/page	~0.5s/page	~0.5s/page

For a 10-page digital PDF:

Auto: ~0 seconds (uses text layer)
All: ~5 seconds (OCR all pages)

For a 10-page scanned PDF:

Auto: ~5 seconds (OCR all pages)
All: ~5 seconds (OCR all pages)

Error Handling

Chunkr includes fallback mechanisms for OCR failures:

With `error_handling: "Fail"`

try {
  ocrResults = await performOCR(pages);
} catch (error) {
  // Task fails immediately
  throw new Error('OCR processing failed');
}

With `error_handling: "Continue"`

try {
  ocrResults = await performOCR(pages);
} catch (error) {
  console.log('OCR failed, using PDF text layer fallback');
  // Falls back to PDF text layer
  ocrResults = extractPDFTextLayer(pdf);
  
  // If text layer also empty, continue with empty OCR
  if (ocrResults.every(r => r.length === 0)) {
    ocrResults = pages.map(() => []);
  }
}

Fallback order:

Try primary OCR
Fall back to PDF text layer
Continue with empty OCR (segmentation only)

Configuration Examples

Digital Documents (Fastest)

{
  "ocr_strategy": "Auto",
  "high_resolution": false
}

Uses text layer when available, minimal processing time.

Scanned Documents (Balanced)

{
  "ocr_strategy": "All",
  "high_resolution": true
}

Always applies OCR with high-resolution images for best accuracy.

Mixed Documents (Recommended)

{
  "ocr_strategy": "Auto",
  "high_resolution": true,
  "error_handling": "Continue"
}

Smart strategy with high quality and graceful fallbacks.

Speed-Optimized

{
  "ocr_strategy": "Auto",
  "high_resolution": false,
  "segmentation_strategy": "Page"
}

Minimizes processing time for simple documents.

Advanced Topics

Coordinate Scaling

OCR coordinates are scaled based on resolution settings:

const scalingFactor = config.high_resolution 
  ? config.high_res_scaling_factor  // e.g., 2.0
  : 1.0;

// Scale OCR bounding boxes
ocrResults.forEach(result => {
  result.bbox.left *= scalingFactor;
  result.bbox.top *= scalingFactor;
  result.bbox.width *= scalingFactor;
  result.bbox.height *= scalingFactor;
});

This ensures OCR coordinates match the processed image dimensions.

Multi-Language Support

Chunkr’s OCR models support multiple languages:

✅ English (primary)
✅ Spanish, French, German
✅ Chinese, Japanese, Korean
✅ Many other languages

Language is automatically detected - no configuration needed.

Monitoring OCR Quality

Track OCR quality in your application:

// Calculate average confidence per segment
const avgConfidence = segment.ocr
  .filter(r => r.confidence !== undefined)
  .reduce((sum, r) => sum + r.confidence, 0) / segment.ocr.length;

if (avgConfidence < 0.8) {
  console.warn(`Low OCR quality for segment ${segment.segment_id}`);
}

// Count low-confidence words
const lowConfidenceWords = segment.ocr
  .filter(r => r.confidence < 0.7)
  .length;

Next Steps

Segmentation

Learn how OCR results are assigned to segments

Pipelines

Understand the complete processing pipeline

Chunking

See how text is combined into chunks

API Reference

Complete API documentation

Getting Started

Core Concepts

Configuration

Deployment

Guides

OCR Strategies

OCR Output Structure

Example OCR Results

How OCR Works

1. PDF Text Layer Extraction

2. OCR Processing

3. Batched OCR

OCR Quality and Confidence

Confidence Scores

Factors Affecting OCR Quality

OCR Result Assignment

Text Layer vs OCR

When to Use Each

Use Auto (Text Layer)

Use All (OCR)

Performance Comparison

Error Handling

With `error_handling: "Fail"`

With `error_handling: "Continue"`

Configuration Examples

Digital Documents (Fastest)

Scanned Documents (Balanced)

Mixed Documents (Recommended)

Speed-Optimized

Advanced Topics

Coordinate Scaling

Multi-Language Support

Monitoring OCR Quality

Next Steps

Segmentation

Pipelines

Chunking

API Reference

Getting Started

Core Concepts

Configuration

Deployment

Guides

Documentation Index

​OCR Strategies

​OCR Output Structure

​Example OCR Results

​How OCR Works

​1. PDF Text Layer Extraction

​2. OCR Processing

​3. Batched OCR

​OCR Quality and Confidence

​Confidence Scores

​Factors Affecting OCR Quality

​OCR Result Assignment

​Text Layer vs OCR

​When to Use Each

Use Auto (Text Layer)

Use All (OCR)

​Performance Comparison

​Error Handling

​With error_handling: "Fail"

​With error_handling: "Continue"

​Configuration Examples

​Digital Documents (Fastest)

​Scanned Documents (Balanced)

​Mixed Documents (Recommended)

​Speed-Optimized

​Advanced Topics

​Coordinate Scaling

​Multi-Language Support

​Monitoring OCR Quality

​Next Steps

Segmentation

Pipelines

Chunking

API Reference

OCR Strategies

OCR Output Structure

Example OCR Results

How OCR Works

1. PDF Text Layer Extraction

2. OCR Processing

3. Batched OCR

OCR Quality and Confidence

Confidence Scores

Factors Affecting OCR Quality

OCR Result Assignment

Text Layer vs OCR

When to Use Each

Performance Comparison

Error Handling

With `error_handling: "Fail"`

With `error_handling: "Continue"`

Configuration Examples

Digital Documents (Fastest)

Scanned Documents (Balanced)

Mixed Documents (Recommended)

Speed-Optimized

Advanced Topics

Coordinate Scaling

Multi-Language Support

Monitoring OCR Quality

Next Steps