Pipelines

Chunkr’s pipeline orchestrates the entire document processing workflow, from upload to final output. Understanding the pipeline helps you optimize processing for your use case.

Pipeline Architecture

The Chunkr pipeline consists of sequential steps executed in order:

Pipeline Steps

Each step in the pipeline is represented by the PipelineStep enum:

Available Steps

pub enum PipelineStep {
    ConvertToImages,
    ChunkrAnalysis,
    Crop,
    SegmentProcessing,
    Chunking,
}

ConvertToImages

Converts PDF pages into high-quality JPEG images for processing.When it runs: After task initializationWhat it does:

Converts each PDF page to a JPEG image
Applies scaling factor based on high_resolution setting
Stores images in memory for subsequent steps

Configuration:

high_resolution: true → Higher quality, slower (~7s per page)
high_resolution: false → Standard quality, faster

ChunkrAnalysis

Performs OCR and layout analysis on document pages.When it runs: After image conversionWhat it does:

Extracts text using OCR based on ocr_strategy
Detects layout elements based on segmentation_strategy
Creates initial segments with bounding boxes and OCR results
Scales coordinates according to resolution settings

Output: Array of segments with type classifications and OCR data

Crop

Crops segment images from page images for visual elements.When it runs: After analysis, before segment processingWhat it does:

Extracts image regions for each segment based on bounding boxes
Stores cropped images for pictures, tables, and other visual elements
Applies padding configured in segmentation_padding

Configuration: Controlled per segment type via crop_image setting

SegmentProcessing

Post-processes segments to generate structured content.When it runs: After croppingWhat it does:

Generates HTML/Markdown from segments using configured strategy
Applies LLM models for complex elements (tables, formulas)
Creates the content, html, markdown, and optionally llm fields
Processes each segment type according to its configuration

See: Segment Processing Configuration

Chunking

Combines segments into semantically meaningful chunks.When it runs: Final stepWhat it does:

Groups segments based on hierarchy and target length
Generates embed text for each chunk
Calculates chunk lengths using specified tokenizer
Optionally filters out headers and footers

See: Chunking

Pipeline Configuration

The pipeline behavior is controlled through the Configuration object:

Core Settings

{
  "segmentation_strategy": "LayoutAnalysis",
  "ocr_strategy": "Auto",
  "high_resolution": true,
  "error_handling": "Fail",
  "expires_in": 3600
}

segmentation_strategy

SegmentationStrategy

default:"LayoutAnalysis"

Controls layout analysis approach:

LayoutAnalysis: Detect layout elements (tables, images, etc.)
Page: Treat each page as a single segment

See Segmentation for details.

ocr_strategy

OcrStrategy

default:"All"

Controls OCR behavior:

All: Process all pages with OCR (~0.5s per page)
Auto: Use existing text layer when available

See OCR for details.

high_resolution

boolean

default:"true"

Use high-resolution images for processing.

true: Better accuracy, ~7s latency per page
false: Faster processing, standard quality

error_handling

ErrorHandlingStrategy

default:"Fail"

How to handle processing errors:

Fail: Stop processing on any error
Continue: Attempt to continue despite non-critical errors

expires_in

number

default:"null"

Seconds until task is automatically deleted. Expired tasks cannot be accessed or updated.

Segment Processing Configuration

Each segment type can be configured independently:

{
  "segment_processing": {
    "Table": {
      "strategy": "LLM",
      "format": "Html",
      "crop_image": "Auto",
      "llm": "Optional custom prompt",
      "embed_sources": ["Content"],
      "extended_context": false
    },
    "Text": {
      "strategy": "Auto",
      "format": "Markdown",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "Picture": {
      "strategy": "Auto",
      "format": "Markdown",
      "crop_image": "All",
      "embed_sources": ["Content"]
    }
  }
}

Available segment types:

Title
SectionHeader
Text
ListItem
Table
Picture
Caption
Formula
Footnote
PageHeader
PageFooter
Page

Configuration options per segment type:

strategy

GenerationStrategy

How content is generated:

Auto: Heuristic-based conversion (default for most types)
LLM: Use fine-tuned Chunkr models (default for Table, Formula)

format

SegmentFormat

Output format:

Markdown: Markdown representation (default for most types)
Html: HTML representation (default for Table)

crop_image

CroppingStrategy

When to crop segment images:

Auto: Crop only when needed for processing
All: Always crop (for Picture type)

llm

string

Optional custom prompt for LLM processing. Use this to generate a custom llm field in the segment output using off-the-shelf models.

embed_sources

EmbedSource[]

Which content sources to include in chunk embed field:

Content: The main content field (HTML or Markdown based on format)
LLM: LLM-generated custom content
HTML: (Deprecated) HTML representation
Markdown: (Deprecated) Markdown representation

Order matters - sources are concatenated in array order.

extended_context

boolean

default:"false"

Use full page image as context for LLM generation. Provides more context but increases processing time.

Chunk Processing Configuration

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Word",
    "ignore_headers_and_footers": true
  }
}

target_length

number

default:"512"

Target number of tokens/words per chunk. Set to 0 to keep one segment per chunk.

tokenizer

TokenizerType

default:"Word"

Tokenizer for counting chunk length:

Word: Simple whitespace tokenization
Cl100kBase: OpenAI tokenizer (GPT-3.5, GPT-4)
xlm-roberta-base: Multilingual RoBERTa
bert-base-uncased: BERT base
Any Hugging Face tokenizer (e.g., "Qwen/Qwen-tokenizer")

ignore_headers_and_footers

boolean

default:"true"

Whether to exclude page headers and footers from chunks. Recommended as they break reading order across pages.

Pipeline State

The pipeline maintains state throughout processing:

pub struct Pipeline {
    pub input_file: Option<Arc<NamedTempFile>>,
    pub pdf_file: Option<Arc<NamedTempFile>>,
    pub page_images: Option<Vec<Arc<NamedTempFile>>>,
    pub segment_images: DashMap<String, Arc<NamedTempFile>>,
    pub chunks: Vec<Chunk>,
    pub task: Option<Task>,
    pub task_payload: Option<TaskPayload>,
}

The pipeline:

Downloads the input file from storage
Converts to PDF if needed
Generates page images
Performs OCR and segmentation
Crops segment images
Processes segments
Creates chunks
Uploads artifacts to storage

Error Handling

The pipeline includes retry logic and error handling:

{
  "error_handling": "Continue"
}

Strategies:

Fail
Continue

Stop processing on any error

Task fails immediately when an error occurs
Provides fastest feedback on issues
Recommended for production when quality is critical

Use when you need guaranteed quality and can handle failures.

Each pipeline step has configurable retry logic with exponential backoff (default: 3 retries).

Monitoring Pipeline Progress

Track pipeline progress through task status:

{
  "status": "Processing",
  "message": "Running Chunkr analysis",
  "started_at": "2024-03-02T10:30:00Z"
}

Pipeline messages:

"Task initialized" → Ready to start
"Converting pages to images" → Step 1
"Running Chunkr analysis" → Step 2 (OCR + Segmentation)
"Cropping segments" → Step 3
"Processing segments" → Step 4 (LLM processing)
"Chunking" → Step 5
"Finishing up" → Uploading artifacts

Performance Considerations

Processing time scales with:

Number of pages: Linear scaling
High resolution: ~7s per page additional latency
OCR strategy: All adds ~0.5s per page
LLM processing: Depends on number of tables/formulas

Optimization tips:

Use ocr_strategy: "Auto" for PDFs with text layers
Disable high resolution for simple documents
Limit LLM processing to only segment types that need it
Set appropriate target_length to control chunk count
Use error_handling: "Continue" for best-effort processing

Next Steps

Segmentation Strategies

Learn about layout analysis options

OCR Configuration

Understand text extraction strategies

Chunking

Learn how segments are combined

API Reference

See complete API documentation

Getting Started

Core Concepts

Configuration

Deployment

Guides

Pipeline Architecture

Pipeline Steps

Available Steps

Pipeline Configuration

Core Settings

Segment Processing Configuration

Chunk Processing Configuration

Pipeline State

Error Handling

Monitoring Pipeline Progress

Performance Considerations

Next Steps

Segmentation Strategies

OCR Configuration

Chunking

API Reference

Getting Started

Core Concepts

Configuration

Deployment

Guides

Documentation Index

​Pipeline Architecture

​Pipeline Steps

​Available Steps

​Pipeline Configuration

​Core Settings

​Segment Processing Configuration

​Chunk Processing Configuration

​Pipeline State

​Error Handling

​Monitoring Pipeline Progress

​Performance Considerations

​Next Steps

Segmentation Strategies

OCR Configuration

Chunking

API Reference

Pipeline Architecture

Pipeline Steps

Available Steps

Pipeline Configuration

Core Settings

Segment Processing Configuration

Chunk Processing Configuration

Pipeline State

Error Handling

Monitoring Pipeline Progress

Performance Considerations

Next Steps