Chunkr’s pipeline orchestrates the entire document processing workflow, from upload to final output. Understanding the pipeline helps you optimize processing for your use case.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt
Use this file to discover all available pages before exploring further.
Pipeline Architecture
The Chunkr pipeline consists of sequential steps executed in order:Pipeline Steps
Each step in the pipeline is represented by thePipelineStep enum:
Available Steps
ConvertToImages
ConvertToImages
Converts PDF pages into high-quality JPEG images for processing.When it runs: After task initializationWhat it does:
- Converts each PDF page to a JPEG image
- Applies scaling factor based on
high_resolutionsetting - Stores images in memory for subsequent steps
high_resolution: true→ Higher quality, slower (~7s per page)high_resolution: false→ Standard quality, faster
ChunkrAnalysis
ChunkrAnalysis
Performs OCR and layout analysis on document pages.When it runs: After image conversionWhat it does:
- Extracts text using OCR based on
ocr_strategy - Detects layout elements based on
segmentation_strategy - Creates initial segments with bounding boxes and OCR results
- Scales coordinates according to resolution settings
Crop
Crop
Crops segment images from page images for visual elements.When it runs: After analysis, before segment processingWhat it does:
- Extracts image regions for each segment based on bounding boxes
- Stores cropped images for pictures, tables, and other visual elements
- Applies padding configured in
segmentation_padding
crop_image settingSegmentProcessing
SegmentProcessing
Post-processes segments to generate structured content.When it runs: After croppingWhat it does:
- Generates HTML/Markdown from segments using configured strategy
- Applies LLM models for complex elements (tables, formulas)
- Creates the
content,html,markdown, and optionallyllmfields - Processes each segment type according to its configuration
Chunking
Chunking
Combines segments into semantically meaningful chunks.When it runs: Final stepWhat it does:
- Groups segments based on hierarchy and target length
- Generates embed text for each chunk
- Calculates chunk lengths using specified tokenizer
- Optionally filters out headers and footers
Pipeline Configuration
The pipeline behavior is controlled through theConfiguration object:
Core Settings
Controls layout analysis approach:
LayoutAnalysis: Detect layout elements (tables, images, etc.)Page: Treat each page as a single segment
Controls OCR behavior:
All: Process all pages with OCR (~0.5s per page)Auto: Use existing text layer when available
Use high-resolution images for processing.
true: Better accuracy, ~7s latency per pagefalse: Faster processing, standard quality
How to handle processing errors:
Fail: Stop processing on any errorContinue: Attempt to continue despite non-critical errors
Seconds until task is automatically deleted. Expired tasks cannot be accessed or updated.
Segment Processing Configuration
Each segment type can be configured independently:TitleSectionHeaderTextListItemTablePictureCaptionFormulaFootnotePageHeaderPageFooterPage
How content is generated:
Auto: Heuristic-based conversion (default for most types)LLM: Use fine-tuned Chunkr models (default for Table, Formula)
Output format:
Markdown: Markdown representation (default for most types)Html: HTML representation (default for Table)
When to crop segment images:
Auto: Crop only when needed for processingAll: Always crop (for Picture type)
Optional custom prompt for LLM processing. Use this to generate a custom
llm field in the segment output using off-the-shelf models.Which content sources to include in chunk embed field:
Content: The main content field (HTML or Markdown based on format)LLM: LLM-generated custom contentHTML: (Deprecated) HTML representationMarkdown: (Deprecated) Markdown representation
Use full page image as context for LLM generation. Provides more context but increases processing time.
Chunk Processing Configuration
Target number of tokens/words per chunk. Set to
0 to keep one segment per chunk.Tokenizer for counting chunk length:
Word: Simple whitespace tokenizationCl100kBase: OpenAI tokenizer (GPT-3.5, GPT-4)xlm-roberta-base: Multilingual RoBERTabert-base-uncased: BERT base- Any Hugging Face tokenizer (e.g.,
"Qwen/Qwen-tokenizer")
Whether to exclude page headers and footers from chunks. Recommended as they break reading order across pages.
Pipeline State
The pipeline maintains state throughout processing:- Downloads the input file from storage
- Converts to PDF if needed
- Generates page images
- Performs OCR and segmentation
- Crops segment images
- Processes segments
- Creates chunks
- Uploads artifacts to storage
Error Handling
The pipeline includes retry logic and error handling:- Fail
- Continue
Stop processing on any error
- Task fails immediately when an error occurs
- Provides fastest feedback on issues
- Recommended for production when quality is critical
Monitoring Pipeline Progress
Track pipeline progress through task status:"Task initialized"→ Ready to start"Converting pages to images"→ Step 1"Running Chunkr analysis"→ Step 2 (OCR + Segmentation)"Cropping segments"→ Step 3"Processing segments"→ Step 4 (LLM processing)"Chunking"→ Step 5"Finishing up"→ Uploading artifacts
Performance Considerations
Optimization tips:- Use
ocr_strategy: "Auto"for PDFs with text layers - Disable high resolution for simple documents
- Limit LLM processing to only segment types that need it
- Set appropriate
target_lengthto control chunk count - Use
error_handling: "Continue"for best-effort processing
Next Steps
Segmentation Strategies
Learn about layout analysis options
OCR Configuration
Understand text extraction strategies
Chunking
Learn how segments are combined
API Reference
See complete API documentation