Skip to content

Data Model

OpenRAG uses a dual-database architecture:

  • PostgreSQL for metadata (users, partitions, files, access control)
  • Milvus for content (document chunks, embeddings, vector search)

Implemented using SQLAlchemy ORM with PostgreSQL as the backend.

erDiagram
    partitions ||--o{ files : contains
    partitions ||--o{ partition_memberships : has
    users ||--o{ partition_memberships : belongs_to

    partitions {
        int id PK
        varchar partition UK
        datetime created_at
    }

    files {
        int id PK
        varchar file_id
        varchar partition_name FK
        json file_metadata
    }

    users {
        int id PK
        varchar external_user_id UK
        varchar display_name
        varchar token UK
        boolean is_admin
        datetime created_at
        int file_quota
        int file_count
    }

    partition_memberships {
        int id PK
        varchar partition_name FK
        int user_id FK
        varchar role
        datetime added_at
    }

Stores information about API users and administrators.

ColumnTypeDescription
idInteger (PK)Unique user identifier
external_user_idString (nullable, unique)Optional external system reference
display_nameStringDisplay name
tokenString (unique, hashed)SHA-256 hash of the user’s API token
is_adminBooleanMarks system administrator users
created_atDateTimeTimestamp of creation
file_quotaInteger (nullable)Max files allowed for that user
file_countInteger (default=0)Number of uploaded files

Relationships: memberships one-to-many → PartitionMembership


Represents a logical workspace or “space” that groups files and users.

ColumnTypeDescription
idInteger (PK)Unique partition identifier
partitionString (unique, indexed)Human-readable name / key
created_atDateTimeTimestamp of creation

Relationships:

  • files one-to-many → File
  • memberships one-to-many → PartitionMembership

Represents an indexed file belonging to a partition.

ColumnTypeDescription
idInteger (PK)Internal file identifier
file_idString (indexed)External file identifier (e.g., hash or ID)
partition_nameString (FK → partitions.partition)Partition that owns the file
file_metadataJSONAdditional metadata (format, size, etc.)
relationship_idString (nullable, indexed)Groups related documents (e.g., email thread ID, folder path)
parent_idString (nullable, indexed)Points to hierarchical parent (e.g., parent email)

Indexes:

  • ix_relationship_partition (relationship_id, partition_name) — enables efficient relationship queries
  • ix_parent_partition (parent_id, partition_name) — enables efficient ancestor traversal

Constraints:

  • UniqueConstraint(file_id, partition_name) → a file can appear only once per partition
  • Composite index ix_partition_file (partition_name, file_id) for efficient queries

Defines the many-to-many relationship between users and partitions with role-based access control.

ColumnTypeDescription
idInteger (PK)Unique row ID
partition_nameString (FK → partitions.partition, CASCADE)Partition identifier
user_idInteger (FK → users.id, CASCADE)Linked user
roleStringRole: owner, editor, or viewer
added_atDateTimeTimestamp of membership creation

Constraints:

  • UniqueConstraint(partition_name, user_id) → a user can appear only once per partition
  • CheckConstraint(role IN ('owner','editor','viewer')) → role validation
  • Composite index ix_user_partition (user_id, partition_name)

Milvus stores document chunks with their vector embeddings. The collection uses dynamic fields for flexible metadata.

erDiagram
    CHUNK {
        int64 _id PK "Auto-generated"
        varchar text "Full-text search enabled"
        float_vector vector "Dense HNSW index"
        sparse_float_vector sparse "BM25 sparse index"
        varchar partition "Partition key"
        varchar file_id "Inverted index"
    }

    DYNAMIC_FIELDS {
        int page "Page number"
        string source "Source path"
        string filename "Display name"
        string chunk_type "text OR table OR image"
        int section_id "Chunk navigation ID"
        int prev_section_id "Previous chunk link"
        int next_section_id "Next chunk link"
    }

    CHUNK ||--o{ DYNAMIC_FIELDS : "has dynamic"

Indexes:

  • HNSW on vector field - Fast approximate nearest neighbor search with cosine similarity
  • BM25 on text field - Keyword-based sparse retrieval
  • Inverted on file_id - Fast filtering by file
  • Partition key on partition - Automatic data isolation by tenant

The two databases are linked by shared identifiers: file_id and partition exist in both systems.

flowchart LR
    subgraph PG["PostgreSQL"]
        F[files]
        PM[partition_memberships]
    end

    subgraph MV["Milvus"]
        C[Chunks]
    end

    F -- "file_id" --> C
    F -- "partition" --> C
    PM -- "partition" --> C
DataPostgreSQLMilvusRationale
Partition metadata-Referential integrity, access control
File inventory-Single source of truth for uploaded files
User accounts & roles-Authentication, ACID compliance
Document chunks-Optimized for vector operations
Dense embeddings-HNSW similarity search
Sparse embeddings-BM25 keyword matching

flowchart LR
    A[Upload] --> B{File exists?}
    B -->|Yes| C[Reject duplicate]
    B -->|No| D[Chunk document]
    D --> E[Generate embeddings]
    E --> F[(Milvus: Insert chunks)]
    F --> G[(PostgreSQL: Record file)]
    G --> H[Done]
flowchart LR
    A[Delete request] --> B[(Milvus: Delete chunks)]
    B --> C[(PostgreSQL: Remove file record)]
    C --> D[Done]
flowchart LR
    A[Query] --> B[(PostgreSQL: Check access)]
    B -->|Authorized| C[(Milvus: Hybrid search)]
    C --> D[Rerank results]
    D --> E[Return chunks]
    B -->|Denied| F[403 Forbidden]

  • Roles (owner, editor, viewer) determine what users can do in each partition
  • is_admin users are privileged globally (admin endpoints, user management)
  • SUPER_ADMIN_MODE=true allows the global admin to bypass all partition-level restrictions

Limits the number of files a user can upload (indexed files + pending tasks).

  • Admins always have unlimited quota and can update quota for a given user
  • DEFAULT_FILE_QUOTA < 0 to disable quota checking (e.g., default -1)
  • DEFAULT_FILE_QUOTA >= 0 to set a default quota for all users (note: 0 means users may upload zero files — quota is still enforced)

The default value DEFAULT_FILE_QUOTA is -1, meaning that file quota checking is bypassed.


  • Tokens are generated at user creation time (or-<random hex>)
  • Only a SHA-256 hash is stored in the database
  • During authentication, the incoming Bearer token is hashed and compared with the stored hash