π Document Relationships & Linked Files
This document explains how document relationships are modeled in OpenRAG, providing a flexible system for connecting related documents and preserving structural hierarchies.
1. Overview
Section titled β1. OverviewβOpenRAG provides a general-purpose relationship system that allows clients to model connections between documents. This system is agnostic to specific use cases and can represent various relationship types including email threads, chat conversations, folder structures and more.
The relationship system supports two core patterns:
- Relationship-based grouping: Groups related documents together using a shared identifier
- Hierarchical parent-child links: Tracks parent-child relationships within groups
Relationship Fields
Section titled βRelationship FieldsβTwo fields enable flexible relationship modeling:
-
relationship_id: A shared identifier for documents that belong together- Can represent any logical grouping (thread ID, folder path, project identifier, etc.)
- Documents sharing the same
relationship_idare considered related - Optional and client-defined based on use case
-
parent_id: Tracks hierarchical relationships by pointing to the direct parent documentβs ID- References the parent documentβs
file_id - Enables recursive traversal from any document back to the root
- Can be used in combination with
relationship_id - Optional and client-defined based on use case
- References the parent documentβs
Key Principle: The system does not enforce any specific relationship semantics. Clients are responsible for defining what relationships mean and how to use these fields to model their domain-specific structures.
2. Data Model
Section titled β2. Data ModelβFile Model Extensions
Section titled βFile Model ExtensionsβSee the SQL File model extended with relationship_id and parent_id.
Examples
Section titled βExamplesβThe relationship system is flexible enough to model various real-world scenarios. Here are some common examples:
π§ Email Threads
Section titled βπ§ Email ThreadsβUse Case:
Email conversations form hierarchical reply chains where each message references the one itβs replying to. Preserving this structure enables context-aware retrieval of entire conversations.
Modeling:
relationship_id= email thread ID (from mail server)parent_id= ID of the email being replied to
Example:
Email A (original)βββ relationship_id: "thread-abc123"βββ parent_id: nullβββ file_id: "email-a" βββ Email B (reply) βββ relationship_id: "thread-abc123" βββ parent_id: "email-a" βββ file_id: "email-b" βββ Email C (reply to B) βββ relationship_id: "thread-abc123" βββ parent_id: "email-b" βββ file_id: "email-c"Benefits:
- Retrieve entire conversation thread as a unit
- Navigate from any reply back to the original message
- Context-aware search expands single email results to full threads
π Folder-Based Organization
Section titled βπ Folder-Based OrganizationβUse Case:
Files stored in the same folder are conceptually related and should be retrievable as a group to enrich the context for RAG.
Modeling Option 1 - Flat Grouping:
relationship_id= normalized folder path (e.g.,documents/projects/2024)parent_id= not used (files are peers in the same folder)
Example:
Documents/2024/Q1/βββ Report.pdfβ βββ relationship_id: "documents/2024/q1"β βββ parent_id: nullβββ Budget.xlsxβ βββ relationship_id: "documents/2024/q1"β βββ parent_id: nullβββ Notes.md βββ relationship_id: "documents/2024/q1" βββ parent_id: nullModeling Option 2 - Nested Folder Hierarchy:
relationship_id= root folder identifier (shared across all nested files)parent_id= immediate parent folderβs identifier
Example:
FolderA/βββ fileAβ βββ relationship_id: "folderA"β βββ parent_id: "folderA"βββ fileBβ βββ relationship_id: "folderA"β βββ parent_id: "folderA"βββ folderA.1/β βββ fileA.1β βββ relationship_id: "folderA"β βββ parent_id: "folderA.1"βββ folderA.2/ βββ fileA.2 βββ relationship_id: "folderA" βββ parent_id: "folderA.2"- All files share
relationship_id: "folderA"(grouped under the root folder) - Each fileβs
parent_idpoints to its immediate parent folder.
3. API Endpoints
Section titled β3. API Endpointsβ- Indexation endpoints: refer to this documents indexing section
- For search with related documents, refer to this section
- One can fetch documents that share a relationship or fetch the ancestors of a given file: refer to the Partition and File section
4. Usage with the RAG Pipeline
Section titled β4. Usage with the RAG PipelineβDocument relationships integrate seamlessly into the RAG pipeline workflow:
flowchart TD
A[Chat Message] --> B[Generate Query]
B --> C[Hybrid Search]
C --> D{Reranker Enabled?}
D -->|Yes| E[Rerank Documents]
D -->|No| F{Expansion Enabled?}
E --> F
F -->|Yes| G[Expand with Related Docs]
F -->|No| J[Format Context]
G --> H{Docs Expanded?}
H -->|Yes| I{Reranker Enabled?}
H -->|No| J
I -->|Yes| K[Rerank Again]
I -->|No| J
K --> J
J --> L[LLM Response]