Segment Processing

Overview

Segment processing controls the post-processing, formatting, and generation of content for each segment type detected in your documents. You can configure output formats (HTML or Markdown), generation strategies (Auto or LLM), image cropping, and custom LLM prompts.

Segment Types

Chunkr detects and processes these segment types:

Title - Document titles
SectionHeader - Section and subsection headers
Text - Regular paragraph text
ListItem - List items (bulleted or numbered)
Table - Tables and tabular data
Picture - Images, charts, diagrams
Caption - Image and table captions
Formula - Mathematical formulas and equations
Footnote - Footnotes and references
PageHeader - Page headers
PageFooter - Page footers
Page - Full page content (when using Page segmentation strategy)

Configuration Parameters

Each segment type can be configured with these parameters:

format

enum

default:"Markdown"

Output format for the segment:

Html - HTML formatted content
Markdown - Markdown formatted content

For tables, default is Html.

strategy

enum

default:"Auto"

Content generation strategy:

Auto - Use heuristics and rules (fast, no LLM cost)
LLM - Use Chunkr fine-tuned models (higher quality, requires LLM)

For Table, Formula, and Page segments, default is LLM.

crop_image

enum

default:"Auto"

Image cropping behavior:

Auto - Crop only when needed for post-processing
All - Always crop to segment bounding box
None - Never crop

Cropped images are stored in the segment’s image field.

llm

string

Custom prompt for LLM-based generation. Use this to provide specific instructions for how the segment should be processed.

embed_sources

array

default:["Content"]

Which content sources to include in the chunk’s embed field:

Content - The primary content (uses format setting)
LLM - LLM-generated content
HTML - (deprecated) HTML content
Markdown - (deprecated) Markdown content

The order determines the sequence in the embed field.

extended_context

boolean

default:false

Whether to provide the full page image as context for LLM generation. Useful for segments that need broader context.

Basic Examples

Default Configuration

{
  "segment_processing": {
    "text": {
      "format": "Markdown",
      "strategy": "Auto"
    }
  }
}

HTML Output

{
  "segment_processing": {
    "text": {
      "format": "Html",
      "strategy": "Auto"
    }
  }
}

LLM-Based Generation

{
  "segment_processing": {
    "text": {
      "format": "Markdown",
      "strategy": "LLM"
    }
  }
}

Segment-Specific Configuration

Tables
Pictures
Formulas
Headers & Footers

Tables default to HTML format with LLM generation for best quality:

{
  "segment_processing": {
    "table": {
      "format": "Html",
      "strategy": "LLM",
      "crop_image": "Auto"
    }
  }
}

Options:

Html format preserves table structure better
Markdown format for simpler tables
Auto strategy for basic tables (faster, no LLM cost)
LLM strategy for complex tables with merged cells, etc.

Pictures have special cropping options:

{
  "segment_processing": {
    "picture": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "All"
    }
  }
}

Cropping strategies:

All - Always crop images (recommended for pictures)
Auto - Crop only when needed
None - Use full page image

Output:

Auto strategy generates image markdown: ![Image](url)
LLM strategy can generate descriptions or analyze image content

Formulas default to LLM generation for accurate LaTeX:

{
  "segment_processing": {
    "formula": {
      "format": "Markdown",
      "strategy": "LLM",
      "crop_image": "Auto"
    }
  }
}

LLM strategy converts formulas to LaTeX notation for rendering.

Control whether headers and footers are included:

{
  "chunk_processing": {
    "ignore_headers_and_footers": true
  },
  "segment_processing": {
    "page_header": {
      "format": "Markdown",
      "strategy": "Auto"
    },
    "page_footer": {
      "format": "Markdown",
      "strategy": "Auto"
    }
  }
}

Even with these configurations, headers/footers are excluded from chunks when ignore_headers_and_footers: true.

Advanced Features

Custom LLM Prompts

Provide specific instructions for how segments should be processed:

{
  "segment_processing": {
    "table": {
      "format": "Html",
      "strategy": "LLM",
      "llm": "Convert this table to HTML. Merge cells where appropriate and preserve all numerical precision."
    },
    "picture": {
      "format": "Markdown",
      "strategy": "LLM",
      "llm": "Describe this image in detail, focusing on any charts, graphs, or data visualizations. Include all visible data points and trends."
    }
  }
}

Embedding Configuration

Control what content is included in chunk embeddings:

Content Only
LLM Only
Combined Sources

{
  "segment_processing": {
    "text": {
      "embed_sources": ["Content"]
    }
  }
}

Use the primary content based on the format setting.

{
  "segment_processing": {
    "text": {
      "strategy": "LLM",
      "embed_sources": ["LLM"]
    }
  }
}

Use only LLM-generated content for embeddings.

{
  "segment_processing": {
    "text": {
      "strategy": "LLM",
      "embed_sources": ["Content", "LLM"]
    }
  }
}

Combine primary content and LLM output. Order matters - content appears before LLM in the embed field.

Extended Context

Provide full page context for better LLM understanding:

{
  "segment_processing": {
    "table": {
      "format": "Html",
      "strategy": "LLM",
      "extended_context": true
    }
  }
}

Use cases:

Tables that reference surrounding content
Formulas with context-dependent notation
Images that need page layout understanding

Trade-off: Increases LLM token usage and processing time.

Complete Configuration Example

{
  "segment_processing": {
    "title": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "section_header": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "text": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "list_item": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "table": {
      "format": "Html",
      "strategy": "LLM",
      "crop_image": "Auto",
      "embed_sources": ["Content"],
      "llm": "Convert to clean HTML table with proper structure"
    },
    "picture": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "All",
      "embed_sources": ["Content"]
    },
    "caption": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "formula": {
      "format": "Markdown",
      "strategy": "LLM",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "footnote": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "page_header": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "page_footer": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "page": {
      "format": "Markdown",
      "strategy": "LLM",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    }
  }
}

Output Fields

Each processed segment includes these fields:

{
  "segment_id": "uuid",
  "segment_type": "Table",
  "bbox": { "left": 0, "top": 0, "width": 100, "height": 50 },
  "page_number": 1,
  "content": "<table>...</table>",  // Based on 'format' setting
  "text": "Raw OCR text",
  "html": "<table>...</table>",      // Deprecated
  "markdown": "| Header |\n| --- |", // Deprecated
  "llm": "LLM-generated content",    // If LLM strategy or custom prompt used
  "image": "https://presigned-url",  // If crop_image enabled
  "ocr": [...],
  "confidence": 0.95
}

The content field contains the formatted output based on your format setting. The deprecated html and markdown fields are still available for backwards compatibility.

Best Practices

Use Auto strategy for simple segments
- Faster processing
- No LLM costs
- Good for text, headers, lists
Use LLM strategy for complex segments
- Tables with complex structure
- Mathematical formulas
- Images requiring description
Match format to your use case
- Html for tables and structured content
- Markdown for general text and readability
Configure embed_sources carefully
- Include only necessary sources
- Reduces token usage for embeddings
- Improves retrieval relevance
Use extended_context sparingly
- Higher LLM costs
- Longer processing time
- Only when context is critical
Test custom prompts
- Start with default prompts
- Iterate based on output quality
- Be specific in instructions

Deprecated Fields

The html and markdown fields in configuration are deprecated. Use format and strategy instead:Old (Deprecated):

{
  "table": {
    "html": "LLM",
    "markdown": "Auto"
  }
}

New (Recommended):

{
  "table": {
    "format": "Html",
    "strategy": "LLM"
  }
}

Getting Started

Core Concepts

Configuration

Deployment

Guides

Segment Processing

Overview

Segment Types

Configuration Parameters

Basic Examples

Default Configuration

HTML Output

LLM-Based Generation

Segment-Specific Configuration

Advanced Features

Custom LLM Prompts

Embedding Configuration

Extended Context

Complete Configuration Example

Output Fields

Best Practices

Deprecated Fields

Getting Started

Core Concepts

Configuration

Deployment

Guides

Documentation Index

​Overview

​Segment Types

​Configuration Parameters

​Basic Examples

​Default Configuration

​HTML Output

​LLM-Based Generation

​Segment-Specific Configuration

​Advanced Features

​Custom LLM Prompts

​Embedding Configuration

​Extended Context

​Complete Configuration Example

​Output Fields

​Best Practices

​Deprecated Fields

Overview

Segment Types

Configuration Parameters

Basic Examples

Default Configuration

HTML Output

LLM-Based Generation

Segment-Specific Configuration

Advanced Features

Custom LLM Prompts

Embedding Configuration

Extended Context

Complete Configuration Example

Output Fields

Best Practices

Deprecated Fields