Segmentation

Segmentation is the process of detecting and classifying layout elements in documents. Chunkr’s segmentation engine identifies regions like tables, images, text blocks, and other document components.

Segmentation Strategies

Chunkr supports two segmentation strategies:

{
  "segmentation_strategy": "LayoutAnalysis"
}

LayoutAnalysis
Page

Detect and classify layout elementsThe LayoutAnalysis strategy uses computer vision models to detect document layout elements and classify them into specific types.Features:

Detects 11 different segment types
Provides bounding boxes for each element
Assigns confidence scores
Enables fine-grained chunking
Supports complex document layouts

Use cases:

Academic papers with tables and formulas
Reports with mixed content
Documents requiring precise element extraction
When you need to process tables and images separately

Performance: Adds minimal latency (batched processing)

Treat each page as a single segmentThe Page strategy processes each page as one segment without detecting individual layout elements.Features:

Fastest processing
Simple output structure
Full-page OCR results
Basic chunking only

Use cases:

Simple text documents
When speed is critical
Documents without complex layouts
When element-level processing isn’t needed

Performance: Fastest option, no layout analysis overhead

Segment Types

When using LayoutAnalysis, Chunkr detects these segment types:

Title

Document or section titlesCharacteristics:

Large, prominent text
Usually at document/section start
High hierarchy level (3)
Often triggers new chunks

Example: Main document title, chapter headings

SectionHeader

Section and subsection headersCharacteristics:

Medium hierarchy level (2)
Introduces new sections
Triggers chunk boundaries
Smaller than Title, larger than body text

Example: Section headings, subsection titles

Text

Regular body text paragraphsCharacteristics:

Most common segment type
Hierarchy level 1
Combines with adjacent text in chunks

Example: Paragraphs, body content

ListItem

Bullet points and numbered list itemsCharacteristics:

Individual list entries
Can be chunked together
Preserves list structure

Example: Bullet points, enumerated items

Table

Tabular data and structured contentCharacteristics:

Complex structure
Usually processed with LLM strategy
Can be paired with Caption
Cropped image available

Example: Data tables, comparison chartsDefault processing: LLM-based HTML generation

Picture

Images, figures, and diagramsCharacteristics:

Visual content
Always cropped
Can be paired with Caption
May contain OCR results if text present

Example: Photos, diagrams, chartsDefault processing: Image URL with optional description

Caption

Image and table captionsCharacteristics:

Describes associated visual element
Kept with paired Picture/Table in chunks
Usually smaller text below/above element

Example: “Figure 1: Architecture diagram”

Formula

Mathematical formulas and equationsCharacteristics:

Mathematical notation
Processed with LLM for LaTeX generation
May be inline or block-level

Example: Equations, mathematical expressionsDefault processing: LLM-based LaTeX extraction

Footnote

Footnotes and referencesCharacteristics:

Small text at page bottom
References to main content
Usually numbered or marked

Example: Citations, additional notes

PageHeader

Page headersCharacteristics:

Appears at top of pages
Often repetitive across pages
Can be excluded from chunks

Example: Document title, chapter nameNote: Excluded by default when ignore_headers_and_footers: true

PageFooter

Page footersCharacteristics:

Appears at bottom of pages
Often contains page numbers
Can be excluded from chunks

Example: Page numbers, copyright textNote: Excluded by default when ignore_headers_and_footers: true

Page

Full page segment (only with Page strategy)Characteristics:

Entire page as one segment
Used when no layout analysis performed
Contains all page OCR results

When used: Only with segmentation_strategy: "Page"

Segment Structure

Each segment contains rich metadata:

interface Segment {
  // Identification
  segment_id: string;
  segment_type: SegmentType;
  
  // Position
  bbox: {
    left: number;
    top: number;
    width: number;
    height: number;
  };
  
  // Quality
  confidence?: number;  // 0.0 to 1.0
  
  // Content (generated in segment processing step)
  text: string;         // OCR-extracted text
  content: string;      // Formatted (HTML or Markdown)
  html: string;         // HTML representation
  markdown: string;     // Markdown representation
  llm?: string;         // LLM-generated content
  
  // Page context
  page_number: number;
  page_width: number;
  page_height: number;
  
  // Additional data
  image?: string;       // URL to cropped image
  ocr?: OCRResult[];    // Detailed OCR results
}

How Segmentation Works

1. Object Detection

Chunkr uses object detection models to identify layout elements: The model outputs:

Bounding boxes: [left, top, width, height] coordinates
Class predictions: Integer class IDs (0-10)
Confidence scores: Detection confidence (0.0-1.0)

2. Class Mapping

Class IDs are mapped to segment types:

=> Caption
=> Footnote
=> Formula
=> ListItem
=> PageFooter
=> PageHeader
=> Picture
=> SectionHeader
=> Table
=> Text
=> Title

3. OCR Assignment

OCR results are assigned to segments based on spatial overlap:

Add padding to segment bounding boxes
Calculate intersection area with each OCR result
Assign OCR result to segment with maximum overlap
Adjust OCR coordinates relative to segment

Segmentation padding is configurable via segmentation_padding in worker config. Default padding ensures OCR results near segment edges are captured.

4. Fallback Handling

If no segments are detected:

// Creates a full-page segment
{
  segment_type: "Page",
  bbox: { left: 0, top: 0, width: page_width, height: page_height },
  confidence: 1.0,
  // ... all page OCR results
}

Segmentation Quality

Confidence Scores

Each segment includes a confidence score from the detection model:

> 0.9: High confidence, very reliable
0.7 - 0.9: Good confidence, usually accurate
0.5 - 0.7: Medium confidence, may need review
< 0.5: Low confidence, likely false positive

Filter segments by confidence threshold if you need high-precision extraction:

const highConfidenceSegments = segments.filter(s => s.confidence > 0.8);

Accuracy Factors

Improves accuracy:

✅ High-resolution images (high_resolution: true)
✅ Clear, well-formatted documents
✅ Standard layouts (papers, reports)
✅ Good contrast and quality scans

May reduce accuracy:

❌ Low-resolution or blurry images
❌ Unusual layouts or designs
❌ Heavily stylized documents
❌ Poor scan quality

Batched Processing

Segmentation uses batched processing for efficiency:

// Pages are processed in batches
const batchSize = config.segmentation_batch_size;  // e.g., 10 pages

// Reduces API calls and improves throughput
batches.forEach(async (batch) => {
  const segments = await performSegmentationBatch(batch);
});

Benefits:

Faster processing for multi-page documents
Efficient resource utilization
Reduced network overhead

Configuration Examples

Academic Papers

{
  "segmentation_strategy": "LayoutAnalysis",
  "high_resolution": true,
  "segment_processing": {
    "Table": { "strategy": "LLM", "format": "Html" },
    "Formula": { "strategy": "LLM", "format": "Markdown" },
    "Picture": { "crop_image": "All" }
  }
}

Simple Documents

{
  "segmentation_strategy": "Page",
  "high_resolution": false,
  "segment_processing": {
    "Page": { "strategy": "Auto", "format": "Markdown" }
  }
}

Reports with Tables

{
  "segmentation_strategy": "LayoutAnalysis",
  "segment_processing": {
    "Table": { 
      "strategy": "LLM", 
      "format": "Html",
      "crop_image": "Auto"
    },
    "Text": { "strategy": "Auto", "format": "Markdown" }
  },
  "chunk_processing": {
    "ignore_headers_and_footers": true
  }
}

Error Handling

With error_handling: "Continue", segmentation failures fall back gracefully:

// If layout analysis fails
try {
  segments = await layoutAnalysis(page);
} catch (error) {
  console.log('Layout analysis failed, using page segmentation');
  // Falls back to full-page segment
  segments = [createPageSegment(page, ocrResults)];
}

Next Steps

OCR Strategies

Learn about text extraction methods

Segment Processing

Configure content generation per segment type

Chunking

Understand how segments are combined

API Reference

See complete API documentation

Getting Started

Core Concepts

Configuration

Deployment

Guides

Segmentation Strategies

Segment Types

Segment Structure

How Segmentation Works

1. Object Detection

2. Class Mapping

3. OCR Assignment

4. Fallback Handling

Segmentation Quality

Confidence Scores

Accuracy Factors

Batched Processing

Configuration Examples

Academic Papers

Simple Documents

Reports with Tables

Error Handling

Next Steps

OCR Strategies

Segment Processing

Chunking

API Reference

Getting Started

Core Concepts

Configuration

Deployment

Guides

Documentation Index

​Segmentation Strategies

​Segment Types

​Segment Structure

​How Segmentation Works

​1. Object Detection

​2. Class Mapping

​3. OCR Assignment

​4. Fallback Handling

​Segmentation Quality

​Confidence Scores

​Accuracy Factors

​Batched Processing

​Configuration Examples

​Academic Papers

​Simple Documents

​Reports with Tables

​Error Handling

​Next Steps

OCR Strategies

Segment Processing

Chunking

API Reference

Segmentation Strategies

Segment Types

Segment Structure

How Segmentation Works

1. Object Detection

2. Class Mapping

3. OCR Assignment

4. Fallback Handling

Segmentation Quality

Confidence Scores

Accuracy Factors

Batched Processing

Configuration Examples

Academic Papers

Simple Documents

Reports with Tables

Error Handling

Next Steps