Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Segment processing controls the post-processing, formatting, and generation of content for each segment type detected in your documents. You can configure output formats (HTML or Markdown), generation strategies (Auto or LLM), image cropping, and custom LLM prompts.

Segment Types

Chunkr detects and processes these segment types:
  • Title - Document titles
  • SectionHeader - Section and subsection headers
  • Text - Regular paragraph text
  • ListItem - List items (bulleted or numbered)
  • Table - Tables and tabular data
  • Picture - Images, charts, diagrams
  • Caption - Image and table captions
  • Formula - Mathematical formulas and equations
  • Footnote - Footnotes and references
  • PageHeader - Page headers
  • PageFooter - Page footers
  • Page - Full page content (when using Page segmentation strategy)

Configuration Parameters

Each segment type can be configured with these parameters:
format
enum
default:"Markdown"
Output format for the segment:
  • Html - HTML formatted content
  • Markdown - Markdown formatted content
For tables, default is Html.
strategy
enum
default:"Auto"
Content generation strategy:
  • Auto - Use heuristics and rules (fast, no LLM cost)
  • LLM - Use Chunkr fine-tuned models (higher quality, requires LLM)
For Table, Formula, and Page segments, default is LLM.
crop_image
enum
default:"Auto"
Image cropping behavior:
  • Auto - Crop only when needed for post-processing
  • All - Always crop to segment bounding box
  • None - Never crop
Cropped images are stored in the segment’s image field.
llm
string
Custom prompt for LLM-based generation. Use this to provide specific instructions for how the segment should be processed.
embed_sources
array
default:["Content"]
Which content sources to include in the chunk’s embed field:
  • Content - The primary content (uses format setting)
  • LLM - LLM-generated content
  • HTML - (deprecated) HTML content
  • Markdown - (deprecated) Markdown content
The order determines the sequence in the embed field.
extended_context
boolean
default:false
Whether to provide the full page image as context for LLM generation. Useful for segments that need broader context.

Basic Examples

Default Configuration

{
  "segment_processing": {
    "text": {
      "format": "Markdown",
      "strategy": "Auto"
    }
  }
}

HTML Output

{
  "segment_processing": {
    "text": {
      "format": "Html",
      "strategy": "Auto"
    }
  }
}

LLM-Based Generation

{
  "segment_processing": {
    "text": {
      "format": "Markdown",
      "strategy": "LLM"
    }
  }
}

Segment-Specific Configuration

Tables default to HTML format with LLM generation for best quality:
{
  "segment_processing": {
    "table": {
      "format": "Html",
      "strategy": "LLM",
      "crop_image": "Auto"
    }
  }
}
Options:
  • Html format preserves table structure better
  • Markdown format for simpler tables
  • Auto strategy for basic tables (faster, no LLM cost)
  • LLM strategy for complex tables with merged cells, etc.

Advanced Features

Custom LLM Prompts

Provide specific instructions for how segments should be processed:
{
  "segment_processing": {
    "table": {
      "format": "Html",
      "strategy": "LLM",
      "llm": "Convert this table to HTML. Merge cells where appropriate and preserve all numerical precision."
    },
    "picture": {
      "format": "Markdown",
      "strategy": "LLM",
      "llm": "Describe this image in detail, focusing on any charts, graphs, or data visualizations. Include all visible data points and trends."
    }
  }
}

Embedding Configuration

Control what content is included in chunk embeddings:
{
  "segment_processing": {
    "text": {
      "embed_sources": ["Content"]
    }
  }
}
Use the primary content based on the format setting.

Extended Context

Provide full page context for better LLM understanding:
{
  "segment_processing": {
    "table": {
      "format": "Html",
      "strategy": "LLM",
      "extended_context": true
    }
  }
}
Use cases:
  • Tables that reference surrounding content
  • Formulas with context-dependent notation
  • Images that need page layout understanding
Trade-off: Increases LLM token usage and processing time.

Complete Configuration Example

{
  "segment_processing": {
    "title": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "section_header": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "text": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "list_item": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "table": {
      "format": "Html",
      "strategy": "LLM",
      "crop_image": "Auto",
      "embed_sources": ["Content"],
      "llm": "Convert to clean HTML table with proper structure"
    },
    "picture": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "All",
      "embed_sources": ["Content"]
    },
    "caption": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "formula": {
      "format": "Markdown",
      "strategy": "LLM",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "footnote": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "page_header": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "page_footer": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "page": {
      "format": "Markdown",
      "strategy": "LLM",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    }
  }
}

Output Fields

Each processed segment includes these fields:
{
  "segment_id": "uuid",
  "segment_type": "Table",
  "bbox": { "left": 0, "top": 0, "width": 100, "height": 50 },
  "page_number": 1,
  "content": "<table>...</table>",  // Based on 'format' setting
  "text": "Raw OCR text",
  "html": "<table>...</table>",      // Deprecated
  "markdown": "| Header |\n| --- |", // Deprecated
  "llm": "LLM-generated content",    // If LLM strategy or custom prompt used
  "image": "https://presigned-url",  // If crop_image enabled
  "ocr": [...],
  "confidence": 0.95
}
The content field contains the formatted output based on your format setting. The deprecated html and markdown fields are still available for backwards compatibility.

Best Practices

  1. Use Auto strategy for simple segments
    • Faster processing
    • No LLM costs
    • Good for text, headers, lists
  2. Use LLM strategy for complex segments
    • Tables with complex structure
    • Mathematical formulas
    • Images requiring description
  3. Match format to your use case
    • Html for tables and structured content
    • Markdown for general text and readability
  4. Configure embed_sources carefully
    • Include only necessary sources
    • Reduces token usage for embeddings
    • Improves retrieval relevance
  5. Use extended_context sparingly
    • Higher LLM costs
    • Longer processing time
    • Only when context is critical
  6. Test custom prompts
    • Start with default prompts
    • Iterate based on output quality
    • Be specific in instructions

Deprecated Fields

The html and markdown fields in configuration are deprecated. Use format and strategy instead:Old (Deprecated):
{
  "table": {
    "html": "LLM",
    "markdown": "Auto"
  }
}
New (Recommended):
{
  "table": {
    "format": "Html",
    "strategy": "LLM"
  }
}