Skip to content

πŸ”— Document Relationships & Linked Files

This document explains how document relationships are modeled in OpenRAG, providing a flexible system for connecting related documents and preserving structural hierarchies.

OpenRAG provides a general-purpose relationship system that allows clients to model connections between documents. This system is agnostic to specific use cases and can represent various relationship types including email threads, chat conversations, folder structures and more.

The relationship system supports two core patterns:

  • Relationship-based grouping: Groups related documents together using a shared identifier
  • Hierarchical parent-child links: Tracks parent-child relationships within groups

Two fields enable flexible relationship modeling:

  • relationship_id: A shared identifier for documents that belong together

    • Can represent any logical grouping (thread ID, folder path, project identifier, etc.)
    • Documents sharing the same relationship_id are considered related
    • Optional and client-defined based on use case
  • parent_id: Tracks hierarchical relationships by pointing to the direct parent document’s ID

    • References the parent document’s file_id
    • Enables recursive traversal from any document back to the root
    • Can be used in combination with relationship_id
    • Optional and client-defined based on use case

Key Principle: The system does not enforce any specific relationship semantics. Clients are responsible for defining what relationships mean and how to use these fields to model their domain-specific structures.

See the SQL File model extended with relationship_id and parent_id.

The relationship system is flexible enough to model various real-world scenarios. Here are some common examples:

Use Case:
Email conversations form hierarchical reply chains where each message references the one it’s replying to. Preserving this structure enables context-aware retrieval of entire conversations.

Modeling:

  • relationship_id = email thread ID (from mail server)
  • parent_id = ID of the email being replied to

Example:

Email A (original)
β”œβ”€β”€ relationship_id: "thread-abc123"
β”œβ”€β”€ parent_id: null
└── file_id: "email-a"
└── Email B (reply)
β”œβ”€β”€ relationship_id: "thread-abc123"
β”œβ”€β”€ parent_id: "email-a"
└── file_id: "email-b"
└── Email C (reply to B)
β”œβ”€β”€ relationship_id: "thread-abc123"
β”œβ”€β”€ parent_id: "email-b"
└── file_id: "email-c"

Benefits:

  • Retrieve entire conversation thread as a unit
  • Navigate from any reply back to the original message
  • Context-aware search expands single email results to full threads

Use Case:
Files stored in the same folder are conceptually related and should be retrievable as a group to enrich the context for RAG.

Modeling Option 1 - Flat Grouping:

  • relationship_id = normalized folder path (e.g., documents/projects/2024)
  • parent_id = not used (files are peers in the same folder)

Example:

Documents/2024/Q1/
β”œβ”€β”€ Report.pdf
β”‚ β”œβ”€β”€ relationship_id: "documents/2024/q1"
β”‚ └── parent_id: null
β”œβ”€β”€ Budget.xlsx
β”‚ β”œβ”€β”€ relationship_id: "documents/2024/q1"
β”‚ └── parent_id: null
└── Notes.md
β”œβ”€β”€ relationship_id: "documents/2024/q1"
└── parent_id: null

Modeling Option 2 - Nested Folder Hierarchy:

  • relationship_id = root folder identifier (shared across all nested files)
  • parent_id = immediate parent folder’s identifier

Example:

FolderA/
β”œβ”€β”€ fileA
β”‚ β”œβ”€β”€ relationship_id: "folderA"
β”‚ └── parent_id: "folderA"
β”œβ”€β”€ fileB
β”‚ β”œβ”€β”€ relationship_id: "folderA"
β”‚ └── parent_id: "folderA"
β”œβ”€β”€ folderA.1/
β”‚ └── fileA.1
β”‚ β”œβ”€β”€ relationship_id: "folderA"
β”‚ └── parent_id: "folderA.1"
└── folderA.2/
└── fileA.2
β”œβ”€β”€ relationship_id: "folderA"
└── parent_id: "folderA.2"
  • All files share relationship_id: "folderA" (grouped under the root folder)
  • Each file’s parent_id points to its immediate parent folder.

Document relationships integrate seamlessly into the RAG pipeline workflow:

flowchart TD
    A[Chat Message] --> B[Generate Query]
    B --> C[Hybrid Search]
    C --> D{Reranker Enabled?}
    D -->|Yes| E[Rerank Documents]
    D -->|No| F{Expansion Enabled?}
    E --> F
    F -->|Yes| G[Expand with Related Docs]
    F -->|No| J[Format Context]
    G --> H{Docs Expanded?}
    H -->|Yes| I{Reranker Enabled?}
    H -->|No| J
    I -->|Yes| K[Rerank Again]
    I -->|No| J
    K --> J
    J --> L[LLM Response]