Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt

Use this file to discover all available pages before exploring further.

Welcome to Chunkr

Chunkr is a production-ready service for document layout analysis, OCR, and semantic chunking. Convert PDFs, PPTs, Word docs, and images into RAG/LLM-ready chunks with structured output.

Quickstart

Get up and running with your first API request in minutes

Installation

Set up Chunkr locally with Docker and GPU/CPU support

API Reference

Explore the complete API documentation

GitHub

View the open-source repository and contribute

Key Features

Automatically detect and segment layout elements including:
  • Tables with structure preservation
  • Images and figures
  • Headers and sections
  • Lists and captions
  • Formulas and equations
  • Text paragraphs
Optical Character Recognition with precise bounding box coordinates:
  • Configurable OCR strategies (All or Auto)
  • High-resolution image processing
  • Text layer extraction from native PDFs
  • Support for scanned documents
Get your documents in multiple formats:
  • HTML with semantic markup
  • Markdown for documentation
  • JSON with coordinates and metadata
  • Configurable per segment type
LLM-powered content enhancement:
  • Table structure extraction
  • Image description generation
  • Content summarization
  • Semantic chunking for RAG

Open Source vs Cloud API

The open-source version uses community models and is perfect for development and testing. For production workloads with higher accuracy and enterprise reliability, check out the Chunkr Cloud API.
FeatureOpen SourceCloud API
Layout AnalysisOpen-source modelsProprietary in-house models
OCR AccuracyCommunity OCR enginesOptimized OCR stack
VLM ProcessingBasic open VLMsEnhanced proprietary VLMs
Excel Support✅ Native parser
InfrastructureSelf-hostedFully managed cloud
SupportDiscord communityDedicated support
The open-source release uses the AGPL-3.0 license. For commercial use without AGPL compliance, contact mehul@chunkr.ai.

Document Types Supported

Chunkr processes a wide range of document formats:
  • PDF - Native and scanned PDFs with full layout analysis
  • PowerPoint - PPT and PPTX presentations
  • Word - DOC and DOCX documents
  • Images - PNG, JPG, TIFF, and other image formats
Excel support is available exclusively in the Cloud API.

Architecture Overview

Chunkr is built with a modern microservices architecture:
  • Server - FastAPI-based REST API (Rust/Actix-Web)
  • Task Queue - Redis-backed job processing
  • Segmentation - YOLO-based layout detection with GPU acceleration
  • OCR - DocTR (Document Text Recognition) engine
  • Storage - MinIO for object storage, PostgreSQL for metadata
  • Web UI - React-based interface for testing and visualization
All services are containerized and orchestrated with Docker Compose for easy deployment.

Use Cases

RAG Pipelines

Extract and chunk documents for retrieval-augmented generation systems

Document Processing

Automate extraction of structured data from unstructured documents

Content Migration

Convert legacy documents to modern formats (HTML, Markdown)

Search Indexing

Extract text and metadata for full-text search engines

Community & Support

Join our community and get help:

Next Steps

1

Quick Start

Follow our quickstart guide to make your first API request
2

Installation

Set up Chunkr locally with our installation guide
3

Explore the API

Dive into the API reference to learn about all available features