Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Task settings control how documents are processed, chunked, and managed throughout their lifecycle. These settings include chunk processing, expiration, resolution, and error handling strategies.Configuration Structure
Task configuration is provided when creating a new processing task:Chunk Processing
Controls how document segments are grouped into chunks for retrieval and embedding.Parameters
Target number of tokens per chunk. If set to
0, each chunk contains exactly one segment.How it works:- Segments are combined until reaching the target length
- Individual segments are never split (they remain intact)
- Final chunks may exceed the target slightly to include complete segments
Tokenizer for measuring chunk length. Supports:Predefined tokenizers:
Word- Simple whitespace-based tokenizationCl100kBase- OpenAI tokenizer (GPT-3.5, GPT-4, text-embedding-ada-002)xlm-roberta-base- Multilingual RoBERTa tokenizerbert-base-uncased- BERT base uncased tokenizer
- Any valid HuggingFace model ID (e.g.,
"Qwen/Qwen-tokenizer","facebook/bart-large")
Whether to exclude page headers and footers from chunks.Recommended: Keep this
true as headers/footers often break reading order across pages.Examples
- Single Segment Chunks
- Word-Based Chunking
- OpenAI Tokenizer
- Custom HuggingFace
- Fine-grained retrieval
- Segment-level processing
- Maximum precision in search
Processing Strategies
OCR Strategy
Controls Optical Character Recognition behavior:
All- Process all pages with OCR (Latency: ~0.5s per page)Auto- Selectively apply OCR only to pages with missing or low-quality text
Auto mode uses existing text layers when available, falling back to OCR only when needed.Segmentation Strategy
Controls document segmentation:
LayoutAnalysis- Detect layout elements (tables, pictures, formulas) with bounding boxes. Provides fine-grained segmentation.Page- Treat each page as a single segment. Faster but without layout element detection.
Error Handling
Controls how errors are handled during processing:
Fail- Stop processing and fail the task on any errorContinue- Attempt to continue despite non-critical errors (e.g., LLM refusals, rate limits)
- Fail on Error
- Continue on Error
- Complete accuracy is critical
- You want to manually review and fix errors
- Processing can be safely retried
Resolution Settings
Whether to use high-resolution images for cropping and post-processing.Trade-offs:
true- Better quality for image segments, tables, and formulas (Latency: ~7s per page)false- Faster processing with standard resolution
Task Expiration
Number of seconds until the task is deleted. Expired tasks cannot be:
- Updated
- Polled
- Accessed via web interface
JOB__EXPIRATION_TIME environment variable.- 1 Hour
- 24 Hours
- 7 Days
- No Expiration
LLM Processing
See the LLM Models page for detailed LLM configuration.ID of the model to use (from your
models.yaml). If not provided, the default model is used.Temperature for LLM generation. Range: 0.0 (deterministic) to 2.0 (creative).
Maximum tokens in LLM responses. Limits output length and cost.
Fallback behavior when primary LLM fails:
None- No fallbackDefault- Use configured fallback modelModel("id")- Use specific model
Complete Example
Task Status
Tasks progress through these states:Starting- Task queued and initializingProcessing- Active processingSucceeded- Completed successfullyFailed- Encountered an errorCancelled- Manually cancelled
Best Practices
-
Match tokenizer to your embedding model
- Use
Cl100kBasefor OpenAI embeddings - Use model-specific tokenizers for other embeddings
- Use
-
Set appropriate chunk sizes
- Smaller chunks (256-512) for precise retrieval
- Larger chunks (1024+) for more context
-
Use
Continueerror handling for batch processing- Prevents individual failures from blocking entire batches
- Review logs for partial failures
-
Configure expiration based on usage
- Short expiration (1 hour) for temporary processing
- Long expiration (7+ days) for production results
- Monitor storage usage
-
Enable high resolution selectively
- Use for documents with important visual elements
- Disable for text-heavy documents to improve speed