Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt
Use this file to discover all available pages before exploring further.
Welcome to Chunkr
Chunkr is a production-ready service for document layout analysis, OCR, and semantic chunking. Convert PDFs, PPTs, Word docs, and images into RAG/LLM-ready chunks with structured output.Quickstart
Get up and running with your first API request in minutes
Installation
Set up Chunkr locally with Docker and GPU/CPU support
API Reference
Explore the complete API documentation
GitHub
View the open-source repository and contribute
Key Features
Layout Analysis
Layout Analysis
Automatically detect and segment layout elements including:
- Tables with structure preservation
- Images and figures
- Headers and sections
- Lists and captions
- Formulas and equations
- Text paragraphs
OCR + Bounding Boxes
OCR + Bounding Boxes
Optical Character Recognition with precise bounding box coordinates:
- Configurable OCR strategies (
AllorAuto) - High-resolution image processing
- Text layer extraction from native PDFs
- Support for scanned documents
Structured Output
Structured Output
Get your documents in multiple formats:
- HTML with semantic markup
- Markdown for documentation
- JSON with coordinates and metadata
- Configurable per segment type
Vision-Language Model Processing
Vision-Language Model Processing
LLM-powered content enhancement:
- Table structure extraction
- Image description generation
- Content summarization
- Semantic chunking for RAG
Open Source vs Cloud API
The open-source version uses community models and is perfect for development and testing. For production workloads with higher accuracy and enterprise reliability, check out the Chunkr Cloud API.| Feature | Open Source | Cloud API |
|---|---|---|
| Layout Analysis | Open-source models | Proprietary in-house models |
| OCR Accuracy | Community OCR engines | Optimized OCR stack |
| VLM Processing | Basic open VLMs | Enhanced proprietary VLMs |
| Excel Support | ❌ | ✅ Native parser |
| Infrastructure | Self-hosted | Fully managed cloud |
| Support | Discord community | Dedicated support |
The open-source release uses the AGPL-3.0 license. For commercial use without AGPL compliance, contact mehul@chunkr.ai.
Document Types Supported
Chunkr processes a wide range of document formats:- PDF - Native and scanned PDFs with full layout analysis
- PowerPoint - PPT and PPTX presentations
- Word - DOC and DOCX documents
- Images - PNG, JPG, TIFF, and other image formats
Excel support is available exclusively in the Cloud API.
Architecture Overview
Chunkr is built with a modern microservices architecture:- Server - FastAPI-based REST API (Rust/Actix-Web)
- Task Queue - Redis-backed job processing
- Segmentation - YOLO-based layout detection with GPU acceleration
- OCR - DocTR (Document Text Recognition) engine
- Storage - MinIO for object storage, PostgreSQL for metadata
- Web UI - React-based interface for testing and visualization
Use Cases
RAG Pipelines
Extract and chunk documents for retrieval-augmented generation systems
Document Processing
Automate extraction of structured data from unstructured documents
Content Migration
Convert legacy documents to modern formats (HTML, Markdown)
Search Indexing
Extract text and metadata for full-text search engines
Community & Support
Join our community and get help:- Discord - Join the community
- GitHub Issues - Report bugs
- Email - mehul@chunkr.ai
- Schedule a call - Book 30 minutes
Next Steps
Quick Start
Follow our quickstart guide to make your first API request
Installation
Set up Chunkr locally with our installation guide
Explore the API
Dive into the API reference to learn about all available features