Skip to content

Environment Variable Configuration

OpenRAG provides a large range of environment variables that allow you to customize and configure various aspects of the application. This page serves as a comprehensive reference for all available environment variables, providing their types, default values, and descriptions. As new variables are introduced, this page will be updated to reflect the growing configuration options.

Openrag loads all files into a pivot markdown file format before proceeding to chunking. Some environment variables can be configured to customized this pipeline

VariableTypeDefaultDescription
IMAGE_CAPTIONINGbooltrueIf true, an LLM is used to describe images and convert them into text using a specific prompt. The image in files are replaced by their descriptions
SAVE_MARKDOWNboolfalseIf true, the pivot-format markdown produced during parsing is saved. Useful for debugging and verifying the correctness of the generated markdown.
SAVE_UPLOADED_FILESboolfalseWhen true, uploaded files are stored on disk. You must enable this option if you want Chainlit to show sources while chatting.
PDFLoaderstrMarkerLoaderSpecifies the PDF parsing engine to use. Available options: PyMuPDFLoader, PyMuPDF4LLMLoader, MarkerLoader and DotsOCRLoader.

The MarkerLoader is the default PDF parsing engine. It can be configured using the following environment variables:

VariableTypeDefaultDescription
MARKER_POOL_SIZEint1Number of workers (typically 1 worker per cluster node)
MARKER_MAX_PROCESSESint2Number of subprocesses <-> Number of concurrent PDFs per worker (to increase depending on your available GPU resources)
MARKER_MAX_TASKS_PER_CHILDint10Number of tasks a child (PDF worker) has to process before it gets restarted to clean up memory leaks
MARKER_MIN_PROCESSESint1Minimum number of subprocesses available before triggering a process pool reset
MARKER_TIMEOUTint3600Timeout in seconds for marker processes

Modern OCR pipelines increasingly rely on VLM-based OCR models (such as DeepSeek OCR, DotsOCR, or LightOn OCR) that convert PDF pages into images and feed them into vision-language models with specialized prompts.
This loader integrates that workflow by exposing an OpenAI-compatible API that accepts PDF image pages and returns structured text produced by the OCR-VLM model in Markdown.

The parameters below configure how the OCR loader communicates with the model server, handles retries, manages concurrency, and controls model sampling behavior.

VariableTypeDefaultDescription
OPENAI_LOADER_BASE_URLstringhttp://openai:8000/v1Base URL of the OCR loader (OpenAI-compatible endpoint).
OPENAI_LOADER_API_KEYstringEMPTYAPI key used to authenticate with the OCR service.
OPENAI_LOADER_MODELstringdotsocr-modelOCR VLM model to use (e.g., DotsOCR, DeepSeek OCR, LightOn OCR).
OPENAI_LOADER_TEMPERATUREfloat0.2Sampling temperature. Lower values produce more deterministic OCR results.
OPENAI_LOADER_TIMEOUTint180Maximum request duration (in seconds) before timing out.
OPENAI_LOADER_MAX_RETRIESint2Number of retry attempts for failed OCR requests.
OPENAI_LOADER_TOP_Pfloat0.9Nucleus sampling parameter that limits generation to the top-p probability mass.
OPENAI_LOADER_CONCURRENCY_LIMITint20Maximum number of OCR requests processed concurrently. Useful for multi-page PDF workloads.

The transcriber is an OpenAI-compatible audio transcription service powered by Whisper models deployed via VLLM. It processes audio input by automatically segmenting it into chunks using silence detection, then transcribes these chunks in parallel for optimal speed and accuracy. This loader includes a bundled VLLM service for users who prefer to run Whisper locally.

To enable this service, set the TRANSCRIBER_COMPOSE variable to extern/transcriber.yaml. By default, it’s disabled !!!

The following environment variables configure its behavior, performance, and connectivity:

VariableTypeDefaultDescription
TRANSCRIBER_BASE_URLstrhttp://transcriber:8000/v1Base URL for the transcriber API (OpenAI-compatible endpoint).
TRANSCRIBER_API_KEYstrEMPTYAuthentication key for transcriber service requests.
TRANSCRIBER_MODELstropenai/whisper-large-v3-turboWhisper model identifier served by VLLM for speech-to-text conversion.
TRANSCRIBER_MAX_CHUNK_MSint30000Maximum duration (milliseconds) for each processed audio segment. Defines the upper limit for chunk length.
TRANSCRIBER_SILENCE_THRESH_DBint-40Silence detection threshold (decibels) for voice activity detection. Audio below this level is classified as silence.
TRANSCRIBER_MIN_SILENCE_LEN_MSint500Minimum silence duration (milliseconds) needed to trigger audio splitting. Shorter pauses are disregarded.
TRANSCRIBER_MAX_CONCURRENT_CHUNKSint20Maximum number of audio chunks processed simultaneously. Increasing this value improves throughput when sufficient GPU resources are available.
VariableTypeDefaultDescription
CHUNKERstrrecursive_splitterDefines the chunking strategy: recursive_splitter, semantic_splitter, or markdown_splitter.
CONTEXTUAL_RETRIEVALbooltrueEnables contextual retrieval to chunk context, a technique introduced by Anthropic to improve retrieval performance (Contextual Retrieval)
CHUNK_SIZEint512Maximum size (in characters) of each chunk.
CHUNK_OVERLAP_RATEfloat0.2Percentage of overlap between consecutive chunks.

After files are converted to Markdown, only the text content is chunked. Image descriptions and Markdown tables are not chunked.

Chunker strategies:

  • recursive_splitter: Uses hierarchical text structure (sections, paragraphs, sentences). Based on RecursiveCharacterTextSplitter, it preserves natural boundaries whenever possible while ensuring chunks never exceeding the CHUNK_SIZE.

  • markdown_splitter: Splits text using Markdown headers, then subdivides sections that exceed CHUNK_SIZE.

  • semantic_splitter: Uses embedding-based semantic similarity to create meaning-preserving chunks. Oversized chunks are chunked to be less than CHUNK_SIZE.

Our embedder is OpenAI-compatible and runs on a VLLM instance configured with the following variables:

VariableTypeDefaultDescription
EMBEDDER_MODEL_NAMEstrjinaai/jina-embeddings-v3HuggingFace Embedding model served by VLLM .i.e Qwen/Qwen3-Embedding-0.6B or jinaai/jina-embeddings-v3
EMBEDDER_BASE_URLstrhttp://vllm:8000/v1Base URL of the embedder (OpenAI-style).
EMBEDDER_API_KEYstrEMPTYAPI key for authenticating embedder calls.

If you prefer to use an external embedding service, simply comment out the embedder service in the docker-compose.yaml and provide the variables above in your environment.

Our system uses two databases that work together:

  • Vector Database (VDB)

The vector database stores embeddings and is configured using the following environment variables:

VariableTypeDefaultDescription
VDB_HOSTstrmilvusHostname of the vector database service
VDB_PORTint19530Port on which the vector database listens
VDB_CONNECTOR_NAMEstrmilvusConnector/driver to use for the vector DB. Currently only milvus is implemented
VDB_COLLECTION_NAMEstrvdb_testName of the collection storing embeddings
VDB_HYBRID_SEARCHbooltrueTo activate hybrid search (semantic similarity + Keyword search)

These variables can be overridden when using an external vector database service.

  • Relational Database (RDB)

The vector database implementation relies on an underlying PostgreSQL database that stores metadata about partitions and their owners (users). For more information about the data structure, see the data model.

The PostgreSQL database is configured using the following environment variables:

VariableTypeDefaultDescription
POSTGRES_HOSTstrrdbHostname of the PostgreSQL database service
POSTGRES_PORTint5432Port on which the PostgreSQL database listens
POSTGRES_USERstrrootUsername for database authentication
POSTGRES_PASSWORDstrroot_passwordPassword for database authentication

The system uses two types of language models:

  • LLM (Large Language Model): The primary model for text generation and chat interactions
  • VLM (Vision Language Model): Used for describing images (see IMAGE_CAPTIONING) and, to reduce load on the primary LLM, also handles contextualization tasks (see CONTEXTUAL_RETRIEVAL)

These are external services to provide !!!

VariableTypeDescription
BASE_URLstrBase URL of the LLM API endpoint
MODELstrModel identifier for the LLM
API_KEYstrAPI key for authenticating with the LLM service
LLM_SEMAPHOREint10
VariableTypeDescription
VLM_BASE_URLstrBase URL of the VLM API endpoint
VLM_MODELstrModel identifier for the VLM
VLM_API_KEYstrAPI key for authenticating with the VLM service
VLM_SEMAPHOREint10

The retriever fetches relevant documents from the vector database based on query similarity. Retrieved documents are then optionally reranked to improve relevance.

VariableTypeDefaultDescription
RETRIEVER_TOP_Kint50Number of documents to retrieve before reranking.
SIMILARITY_THRESHOLDfloat0.6Minimum similarity score (0.0-1.0) for document retrieval. Documents below this threshold are filtered out
RETRIEVER_TYPEstrsingleRetrieval strategy to use. Options: single, multiQuery, hyde
StrategyDescription
singleStandard semantic search using the original query. Fast and efficient for most queries
multiQueryGenerates multiple query variations to improve recall. Better coverage for ambiguous or complex questions
hydeHypothetical Document Embeddings - generates a hypothetical answer then searches for similar documents

The reranker enhances search quality by re-scoring and reordering retrieved documents according to their relevance to the user’s query. Currently, the system uses Infinity server for reranking functionality.

Future Improvements

The current Infinity server interface is not OpenAI-compatible, which limits integration flexibility. We plan to improve this by supporting OpenAI-compatible reranker interfaces in future releases.

VariableTypeDefaultDescription
RERANKER_ENABLEDbooltrueEnable or disable the reranking mechanism
RERANKER_MODELstrAlibaba-NLP/gte-multilingual-reranker-baseModel used for reranking documents.
RERANKER_TOP_Kint5Number of top documents to return after reranking. Increase to 8 for better results if your LLM has a wider context window
RERANKER_BASE_URLstrhttp://reranker:7997Base URL of the reranker service
RERANKER_PORTint7997Port on which the reranker service listens

The RAG pipeline comes with preconfigured prompts ./prompts/example1. Here are available Prompt Templates in that folder.

Template FilePurpose
sys_prompt_tmpl.txtSystem prompt that defines the assistant’s behavior and role
query_contextualizer_tmpl.txtTemplate for adding context to user queries
chunk_contextualizer_tmpl.txtTemplate for contextualizing document chunks during indexing
image_captioning_tmpl.txtTemplate for generating image descriptions using the VLM
hyde.txtHypothetical Document Embeddings (HyDE) query expansion template
multi_query_pmpt_tmpl.txtTemplate for generating multiple query variations

To customize prompt:

  1. Duplicate the example folder: Copy the example1 folder from ./prompts/
  2. Create your custom folder: Rename it to something meaningful, e.g., my_prompt
  3. Modify the prompts: Edit any prompt templates within your new folder
  4. Update configuration: Point to your custom prompts directory
.env
# Use custom prompts
export PROMPTS_DIR=../prompts/my_prompt
VariableTypeDefaultDescription
PROMPTS_DIRstr../prompts/example1Path to the directory containing your prompt templates

Our application uses Loguru with custom formatting. Log messages appear in two places:

  • Terminal (stderr): Human-readable formatted output
  • Log file (logs/app.json): JSON format for monitoring tools like Grafana. This file resides at the mounted folder ./logs

Terminal output follows this format:

Logging message in the terminal...
LEVEL | module:function:line - message [context_key=value]

There are several logging levels available (TRACE, DEBUG, INFO, SUCCESS, WARNING, ERROR, CRITICAL). Only the levels intended for use in this project are documented here.

LevelWhat You’ll See in Logs
WARNINGPotential issues that don’t stop execution: approaching rate limits, deprecated features used, retryable failures, configuration concerns. Review these periodically.
DEBUGDetailed diagnostic information including variable states, intermediate processing steps, and function entry/exit points. Useful during development and troubleshooting.
INFOStandard operational messages showing normal application behavior: server startup, request handling, major workflow stages. This is the typical production level.

Set the logging level via environment variable:

.env
# Show only warnings and errors
LOG_LEVEL=WARNING
# Show detailed debug information (use in dev and pre-prod)
LOG_LEVEL=DEBUG
# Production default (informational messages)
LOG_LEVEL=INFO
  • Rotation: Files rotate automatically at 10 MB
  • Retention: Logs kept for 10 days
  • Format: JSON for easy parsing and ingestion into monitoring systems
  • Async: Queued writing (enqueue=True) prevents blocking operations

Ray is used for distributed task processing and parallel execution in the RAG pipeline. This configuration controls resource allocation, concurrency limits, and serving options.

VariableTypeDefaultDescription
RAY_POOL_SIZEint1Number of serializer actor instances (typically 1 actor per cluster node)
RAY_MAX_TASKS_PER_WORKERint8Maximum number of concurrent tasks (serialization tasks) per serializer actor instance
RAY_DASHBOARD_PORTint8265Ray Dashboard port used for monitoring. In production, comment out this line to avoid exposing the port, as it may introduce security vulnerabilities.
VariableTypevalueDescription
RAY_DEDUP_LOGSnumber0Turns off Ray log deduplication that appears across multiple processes. Set to 0 to see all logs from each process.
RAY_ENABLE_RECORD_ACTOR_TASK_LOGGINGnumber1Enables logs at task level in the Ray dashboard for better debugging and monitoring.
RAY_task_retry_delay_msnumber3000Delay (in milliseconds) before retrying a failed task. Controls the wait time between retry attempts.
RAY_ENABLE_UV_RUN_RUNTIME_ENVnumber0Controls UV runtime environment integration. Critical: Must be set to 0 when using the newest version of UV to avoid compatibility issues.
VariableTypeDefaultDescription
RAY_MAX_TASK_RETRIESint2Number of retry attempts for failed tasks
INDEXER_SERIALIZE_TIMEOUTint36000Timeout in seconds for serialization operations (10 hours)

Controls the maximum number of concurrent operations for different indexer tasks:

VariableTypeDefaultDescription
INDEXER_DEFAULT_CONCURRENCYint1000Default concurrency limit for general operations
INDEXER_UPDATE_CONCURRENCYint100Maximum concurrent document update operations
INDEXER_SEARCH_CONCURRENCYint100Maximum concurrent search/retrieval operations
INDEXER_DELETE_CONCURRENCYint100Maximum concurrent document deletion operations
INDEXER_CHUNK_CONCURRENCYint1000Maximum concurrent document chunking operations
INDEXER_INSERT_CONCURRENCYint10Maximum concurrent document insertion operations
VariableTypeDefaultDescription
RAY_SEMAPHORE_CONCURRENCYint100000Global concurrency limit for Ray semaphore operations

Ray Serve enables deployment of the FastAPI as a scalable service. For simple deployment, without the intend to scale, one can usage the uvicorn deployment mode

VariableTypeDefaultDescription
ENABLE_RAY_SERVEboolfalseEnable Ray Serve deployment mode
RAY_SERVE_NUM_REPLICASint1Number of service replicas for load balancing
RAY_SERVE_HOSTstr0.0.0.0Host address for the Ray Serve deployment
RAY_SERVE_PORTint8080Port for the Ray Serve FastAPI endpoint
CHAINLIT_PORTint8090Port for the Chainlit UI interface if ray serve is enable ENABLE_RAY_SERVE. If not chainlit UI is simply a subroute (/chainlit see this) of the FastAPI base_url

The map & reduce mechanism processes documents by fetching chunks (map phase), filtering out irrelevant ones and summarizing relevant content (reduce phase) with respect to the user’s query. The algorithm works as follows:

  1. Initially fetches a batch of documents for processing
  2. Evaluates relevance and continues expanding the search if needed
  3. Stops expansion when the last MAP_REDUCE_EXPANSION_BATCH_SIZE chunks are all irrelevant
  4. Otherwise, continues fetching additional documents up to MAP_REDUCE_MAX_TOTAL_DOCUMENTS

When MAP_REDUCE_DEBUG is enabled, the mechanism logs detailed information to ./logs/map_reduce.md.

VariableTypeDefaultDescription
MAP_REDUCE_INITIAL_BATCH_SIZEint10Number of documents to process in the initial mapping phase
MAP_REDUCE_EXPANSION_BATCH_SIZEint5Number of additional documents to fetch when expanding the search (also used as the threshold for stopping)
MAP_REDUCE_MAX_TOTAL_DOCUMENTSint20Maximum total number of documents (chunks) to process across all iterations
MAP_REDUCE_DEBUGbooltrueEnable debug logging for map & reduce operations. Logs are written to ./logs/map_reduce.md

By default, our API (FastAPI) uses uvicorn for deployment. One can opt in to use Ray Serve for scalability (see the ray serve configuration)

The following environment variables configure the FastAPI server and control access permissions:

VariableTypeDefaultDescription
APP_PORTnumber8000Port number on which the FastAPI application listens for incoming requests.
AUTH_TOKENstringEMPTYAn authentication token is required to access protected API endpoints. By default, this token corresponds to the API key of the created admin (see Admin Bootstrapping). If left empty, authentication is disabled.
SUPER_ADMIN_MODEbooleanfalseEnables super admin privileges when set to true, granting unrestricted access to all operations and bypassing standard access controls. This is for debugging
API_NUM_WORKERSint1Number of uvicorn workers
VariableTypeDefaultDescription
INCLUDE_CREDENTIALSbooleanfalseIf authentification is
INDEXERUI_PORTnumber8060Port number on which the Indexer UI application runs. Default is 8060 (documentation mentions 3042 as another common default).
INDEXERUI_URLstringhttp://X.X.X.X:INDEXERUI_PORTBase URL of the Indexer UI. Required to prevent CORS issues. Replace X.X.X.X with localhost (local) or your server IP, and INDEXERUI_PORT with the actual port.
API_BASE_URLstringhttp://X.X.X.X:APP_PORTBase URL of your FastAPI backend, used by the frontend to communicate with the API. Replace X.X.X.X with localhost (local) or your server IP, and APP_PORT with your FastAPI port.

See this for chainlit authentification See this for chainlit data persistency