Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt

Use this file to discover all available pages before exploring further.

Chunkr’s pipeline orchestrates the entire document processing workflow, from upload to final output. Understanding the pipeline helps you optimize processing for your use case.

Pipeline Architecture

The Chunkr pipeline consists of sequential steps executed in order:

Pipeline Steps

Each step in the pipeline is represented by the PipelineStep enum:

Available Steps

pub enum PipelineStep {
    ConvertToImages,
    ChunkrAnalysis,
    Crop,
    SegmentProcessing,
    Chunking,
}
Converts PDF pages into high-quality JPEG images for processing.When it runs: After task initializationWhat it does:
  • Converts each PDF page to a JPEG image
  • Applies scaling factor based on high_resolution setting
  • Stores images in memory for subsequent steps
Configuration:
  • high_resolution: true → Higher quality, slower (~7s per page)
  • high_resolution: false → Standard quality, faster
Performs OCR and layout analysis on document pages.When it runs: After image conversionWhat it does:
  • Extracts text using OCR based on ocr_strategy
  • Detects layout elements based on segmentation_strategy
  • Creates initial segments with bounding boxes and OCR results
  • Scales coordinates according to resolution settings
Output: Array of segments with type classifications and OCR data
Crops segment images from page images for visual elements.When it runs: After analysis, before segment processingWhat it does:
  • Extracts image regions for each segment based on bounding boxes
  • Stores cropped images for pictures, tables, and other visual elements
  • Applies padding configured in segmentation_padding
Configuration: Controlled per segment type via crop_image setting
Post-processes segments to generate structured content.When it runs: After croppingWhat it does:
  • Generates HTML/Markdown from segments using configured strategy
  • Applies LLM models for complex elements (tables, formulas)
  • Creates the content, html, markdown, and optionally llm fields
  • Processes each segment type according to its configuration
See: Segment Processing Configuration
Combines segments into semantically meaningful chunks.When it runs: Final stepWhat it does:
  • Groups segments based on hierarchy and target length
  • Generates embed text for each chunk
  • Calculates chunk lengths using specified tokenizer
  • Optionally filters out headers and footers
See: Chunking

Pipeline Configuration

The pipeline behavior is controlled through the Configuration object:

Core Settings

{
  "segmentation_strategy": "LayoutAnalysis",
  "ocr_strategy": "Auto",
  "high_resolution": true,
  "error_handling": "Fail",
  "expires_in": 3600
}
segmentation_strategy
SegmentationStrategy
default:"LayoutAnalysis"
Controls layout analysis approach:
  • LayoutAnalysis: Detect layout elements (tables, images, etc.)
  • Page: Treat each page as a single segment
See Segmentation for details.
ocr_strategy
OcrStrategy
default:"All"
Controls OCR behavior:
  • All: Process all pages with OCR (~0.5s per page)
  • Auto: Use existing text layer when available
See OCR for details.
high_resolution
boolean
default:"true"
Use high-resolution images for processing.
  • true: Better accuracy, ~7s latency per page
  • false: Faster processing, standard quality
error_handling
ErrorHandlingStrategy
default:"Fail"
How to handle processing errors:
  • Fail: Stop processing on any error
  • Continue: Attempt to continue despite non-critical errors
expires_in
number
default:"null"
Seconds until task is automatically deleted. Expired tasks cannot be accessed or updated.

Segment Processing Configuration

Each segment type can be configured independently:
{
  "segment_processing": {
    "Table": {
      "strategy": "LLM",
      "format": "Html",
      "crop_image": "Auto",
      "llm": "Optional custom prompt",
      "embed_sources": ["Content"],
      "extended_context": false
    },
    "Text": {
      "strategy": "Auto",
      "format": "Markdown",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "Picture": {
      "strategy": "Auto",
      "format": "Markdown",
      "crop_image": "All",
      "embed_sources": ["Content"]
    }
  }
}
Available segment types:
  • Title
  • SectionHeader
  • Text
  • ListItem
  • Table
  • Picture
  • Caption
  • Formula
  • Footnote
  • PageHeader
  • PageFooter
  • Page
Configuration options per segment type:
strategy
GenerationStrategy
How content is generated:
  • Auto: Heuristic-based conversion (default for most types)
  • LLM: Use fine-tuned Chunkr models (default for Table, Formula)
format
SegmentFormat
Output format:
  • Markdown: Markdown representation (default for most types)
  • Html: HTML representation (default for Table)
crop_image
CroppingStrategy
When to crop segment images:
  • Auto: Crop only when needed for processing
  • All: Always crop (for Picture type)
llm
string
Optional custom prompt for LLM processing. Use this to generate a custom llm field in the segment output using off-the-shelf models.
embed_sources
EmbedSource[]
Which content sources to include in chunk embed field:
  • Content: The main content field (HTML or Markdown based on format)
  • LLM: LLM-generated custom content
  • HTML: (Deprecated) HTML representation
  • Markdown: (Deprecated) Markdown representation
Order matters - sources are concatenated in array order.
extended_context
boolean
default:"false"
Use full page image as context for LLM generation. Provides more context but increases processing time.

Chunk Processing Configuration

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Word",
    "ignore_headers_and_footers": true
  }
}
target_length
number
default:"512"
Target number of tokens/words per chunk. Set to 0 to keep one segment per chunk.
tokenizer
TokenizerType
default:"Word"
Tokenizer for counting chunk length:
  • Word: Simple whitespace tokenization
  • Cl100kBase: OpenAI tokenizer (GPT-3.5, GPT-4)
  • xlm-roberta-base: Multilingual RoBERTa
  • bert-base-uncased: BERT base
  • Any Hugging Face tokenizer (e.g., "Qwen/Qwen-tokenizer")
ignore_headers_and_footers
boolean
default:"true"
Whether to exclude page headers and footers from chunks. Recommended as they break reading order across pages.

Pipeline State

The pipeline maintains state throughout processing:
pub struct Pipeline {
    pub input_file: Option<Arc<NamedTempFile>>,
    pub pdf_file: Option<Arc<NamedTempFile>>,
    pub page_images: Option<Vec<Arc<NamedTempFile>>>,
    pub segment_images: DashMap<String, Arc<NamedTempFile>>,
    pub chunks: Vec<Chunk>,
    pub task: Option<Task>,
    pub task_payload: Option<TaskPayload>,
}
The pipeline:
  1. Downloads the input file from storage
  2. Converts to PDF if needed
  3. Generates page images
  4. Performs OCR and segmentation
  5. Crops segment images
  6. Processes segments
  7. Creates chunks
  8. Uploads artifacts to storage

Error Handling

The pipeline includes retry logic and error handling:
{
  "error_handling": "Continue"
}
Strategies:
Stop processing on any error
  • Task fails immediately when an error occurs
  • Provides fastest feedback on issues
  • Recommended for production when quality is critical
Use when you need guaranteed quality and can handle failures.
Each pipeline step has configurable retry logic with exponential backoff (default: 3 retries).

Monitoring Pipeline Progress

Track pipeline progress through task status:
{
  "status": "Processing",
  "message": "Running Chunkr analysis",
  "started_at": "2024-03-02T10:30:00Z"
}
Pipeline messages:
  • "Task initialized" → Ready to start
  • "Converting pages to images" → Step 1
  • "Running Chunkr analysis" → Step 2 (OCR + Segmentation)
  • "Cropping segments" → Step 3
  • "Processing segments" → Step 4 (LLM processing)
  • "Chunking" → Step 5
  • "Finishing up" → Uploading artifacts

Performance Considerations

Processing time scales with:
  • Number of pages: Linear scaling
  • High resolution: ~7s per page additional latency
  • OCR strategy: All adds ~0.5s per page
  • LLM processing: Depends on number of tables/formulas
Optimization tips:
  1. Use ocr_strategy: "Auto" for PDFs with text layers
  2. Disable high resolution for simple documents
  3. Limit LLM processing to only segment types that need it
  4. Set appropriate target_length to control chunk count
  5. Use error_handling: "Continue" for best-effort processing

Next Steps

Segmentation Strategies

Learn about layout analysis options

OCR Configuration

Understand text extraction strategies

Chunking

Learn how segments are combined

API Reference

See complete API documentation