Skip to content

Latest commit

 

History

History
1074 lines (899 loc) · 51 KB

File metadata and controls

1074 lines (899 loc) · 51 KB

Boost Data Collector - Schema

Overview

The Boost Data Collector is a Django project backed by a single database (boost_dashboard). All sub-apps use this same database: they do not use separate databases or schema-based isolation. Tables from different apps are linked by relationships (e.g. foreign keys across apps), so data can be joined and reused. Reference data (e.g. Language, License) is defined and owned by one app; other apps reference it via foreign keys rather than duplicating it. The diagrams below show these shared base tables and app-specific tables and how they connect.

Entity Relationship Diagrams

Legend: PK = Primary Key, FK = Foreign Key, UK = Unique Key, IX = Index


1. Base tables, Identity, and profiles

erDiagram
    direction TB
    Email }o-- || BaseProfile : "has"
    BaseProfile ||--o| GitHubAccount : "extends"
    BaseProfile ||--o| SlackUser : "extends"
    BaseProfile ||--o| MailingListProfile : "extends"
    BaseProfile ||--o| WG21PaperAuthorProfile : "extends"
    BaseProfile ||--o| YoutubeSpeaker : "extends"
    BaseProfile ||--o| DiscordProfile : "extends"
    Identity }o--|| BaseProfile  : "has"
    TempProfileIdentityRelation ||--o{ BaseProfile  : "has"
    TmpIdentity ||--o{ TempProfileIdentityRelation : "has"


    BaseProfile {
        int id PK "auto-increment"
        int identity_id FK
        enum type
    }

    Email {
        int id PK
        int base_profile_id FK
        string email "IX"
        boolean is_primary
        boolean is_active
        datetime created_at
        datetime updated_at
    }

    GitHubAccount {
        bigint github_account_id "IX"
        string username "IX"
        string display_name "IX"
        string avatar_url
        enum type
        datetime created_at
        datetime updated_at
    }

    SlackUser {
        string slack_user_id "IX"
        string username "IX"
        string display_name "IX"
        string avatar_url
        datetime created_at
        datetime updated_at
    }

    MailingListProfile {
        string display_name "IX"
        datetime created_at
        datetime updated_at
    }

    WG21PaperAuthorProfile {
        string display_name "IX"
        string author_alias "IX"
        datetime created_at
        datetime updated_at
    }

    YoutubeSpeaker {
        string display_name "IX"
        datetime created_at
        datetime updated_at
    }

    DiscordProfile {
        bigint discord_user_id "UK IX"
        string username "IX"
        string display_name "IX"
        string avatar_url
        boolean is_bot
        datetime created_at
        datetime updated_at
    }

    Identity {
        int id PK
        string display_name "IX"
        text description
        datetime created_at
        datetime updated_at
    }

    TmpIdentity {
        int id PK
        string display_name "IX"
        text description
        datetime created_at
        datetime updated_at
    }

    TempProfileIdentityRelation {
        int id PK
        int base_profile_id FK
        int target_identity_id FK
        datetime created_at
        datetime updated_at
    }
Loading

Note: Each extended table has id as primary key and foreign key to BaseProfile.id. The value is the same: one auto-increment in BaseProfile, and that same id is stored in exactly one extended profile row. Other tables (e.g. GitCommit, Issue) reference the profile via this single id. DiscordProfile (in cppa_user_tracker) is the author profile for DiscordMessage rows in discord_activity_tracker (author_idDiscordProfile.id).

Note: The Email table references BaseProfile via base_profile_id (FK to BaseProfile.id). One profile can have multiple email addresses; is_primary marks the primary email; is_active indicates whether the email is currently active. Other tables (e.g. MailingListMessage) can link to a profile via Email. Note: The email field is not unique; the same email address may appear in multiple rows (e.g. for different profiles or over time).

Note: The type field is a PostgreSQL enum (or equivalent) with values: github, slack, mailing_list, wg21, discord, youtube. It identifies which extended table the row belongs to.

Note: In GitHubAccount, the type field is an enum with values: user, organization, enterprise (identifies whether the GitHub account is a user, organization, or enterprise).

Note: BaseProfile references Identity via identity_id (FK to Identity.id). One identity can have multiple BaseProfiles (e.g. one person with GitHub and Slack). Identity, TmpIdentity, and TempProfileIdentityRelation are used by the CPPA User Tracker: Identity holds the canonical user/account; TmpIdentity and TempProfileIdentityRelation stage temporary profile-to-identity relations (e.g. base_profile_id, target_identity_id) before merging.

2. GitHub Activity Tracker

Part 1: GitHub Account and Repository

GitHubRepository is the base table with all repository fields. GitHubAccount owns GitHubRepository; RepoLanguage and RepoLicense reference the repository.

erDiagram
    Direction LR
    GitHubAccount ||--o{ GitHubRepository : "owns"
    GitHubRepository ||--o{ RepoLanguage : "has"
    GitHubRepository ||--o{ RepoLicense : "has"
    RepoLanguage }o--|| Language : "used_in"
    Language ||--o{ CreatedReposByLanguage : "yearly_stats"
    License ||--o{ RepoLicense : "used_in"

    GitHubRepository {
        int id PK
        int owner_account_id FK
        string repo_name "IX"
        int stars
        int forks
        text description
        datetime repo_pushed_at "IX"
        datetime repo_created_at "IX"
        datetime repo_updated_at "IX"
    }

    Language {
        int id PK
        string name UK "IX"
        datetime created_at
    }

    License {
        int id PK
        string name UK "IX"
        string spdx_id "IX"
        string url
        datetime created_at
    }

    RepoLanguage {
        int id PK
        int repo_id FK
        int language_id FK
        int line_count
        datetime created_at
        datetime updated_at
    }

    RepoLicense {
        int id PK
        int repo_id FK
        int license_id FK
        datetime created_at
    }

    CreatedReposByLanguage {
        int id PK
        int language_id FK
        int year "IX"
        int all_repos
        int significant_repos
        datetime created_at
        datetime updated_at
    }
Loading

Note: GitHubRepository is the base table with all repository fields.

Note: Composite unique constraints should be applied on: (owner_account_id, repo_name) in GitHubRepository, (repo_id, language_id) in RepoLanguage, (repo_id, license_id) in RepoLicense, (language_id, year) in CreatedReposByLanguage.

Part 2: Git Commit and Issues

erDiagram
    GitHubAccount ||--o{ GitCommit : "committer"
    GitHubAccount ||--o{ Issue : "creator"
    GitHubAccount ||--o{ IssueComment : "author"
    GitHubAccount ||--o{ IssueAssignee : "assignee"
    GitHubRepositoy ||--o{ GitCommit : "contains"
    GitHubRepositoy ||--o{ Issue : "contains"
    GitHubRepositoy ||--o{ GitHubFile : "has"
    GitCommit ||--o{ GitCommitFileChange : "has"
    GitHubFile ||--o{ GitCommitFileChange : "changed_in"
    Issue ||--o{ IssueComment : "has"
    Issue ||--o{ IssueAssignee : "has"
    Issue ||--o{ IssueLabel : "has"

    GitCommit {
        bigint id PK
        int repo_id FK
        int account_id FK
        string commit_hash UK "IX"
        text comment
        datetime commit_at "IX"
    }

    GitHubFile {
        BigInt id PK
        int repo_id FK
        string filename "IX"
        boolean is_deleted
        datetime created_at
    }

    GitCommitFileChange {
        int id PK
        bigint commit_id FK
        int github_file_id FK "IX"
        enum status "IX"
        int additions
        int deletions
        text patch
        datetime created_at
    }

    Issue {
        int id PK
        int repo_id FK
        int account_id FK
        int issue_number "IX"
        bigint issue_id UK "IX"
        text title
        text body
        enum state "IX"
        enum state_reason
        datetime issue_created_at "IX"
        datetime issue_updated_at "IX"
        datetime issue_closed_at "IX"
    }

    IssueComment {
        int id PK
        int issue_id FK
        int account_id FK
        bigint issue_comment_id UK "IX"
        text body
        datetime issue_comment_created_at
        datetime issue_comment_updated_at
    }

    IssueAssignee {
        int id PK
        int issue_id FK
        int account_id FK
        datetime created_at
    }

    IssueLabel {
        int id PK
        int issue_id FK
        string label_name "IX"
        datetime created_at
    }
Loading

Note: Enum types:

  • GitCommitFileChange.status: file_change_status enum with values: added, modified, removed, renamed, copied, changed
  • Issue.state: issue_state enum with values: open, closed
  • Issue.state_reason: issue_state_reason enum with values: completed, not_planned, reopened, null

Note: Composite unique constraints should be applied on: (repo_id, commit_hash) in GitCommit, (commit_id, github_file_id) in GitCommitFileChange, (repo_id, issue_number) in Issue, (issue_id, account_id) in IssueAssignee, (issue_id, label_name) in IssueLabel.

Part 3: Pull Requests

erDiagram
    GitHubAccount ||--o{ PullRequest : "creator"
    GitHubAccount ||--o{ PullRequestReview : "reviewer"
    GitHubAccount ||--o{ PullRequestComment : "author"
    GitHubAccount ||--o{ PullRequestAssignee : "assignee"
    GitHubRepository ||--o{ PullRequest : "contains"
    PullRequest ||--o{ PullRequestReview : "has"
    PullRequest ||--o{ PullRequestComment : "has"
    PullRequest ||--o{ PullRequestAssignee : "has"
    PullRequest ||--o{ PullRequestLabel : "has"

    PullRequest {
        int id PK
        int repo_id FK
        int account_id FK
        int pr_number "IX"
        bigint pr_id UK "IX"
        text title
        text body
        enum state "IX"
        string head_hash "IX"
        string base_hash "IX"
        datetime pr_created_at "IX"
        datetime pr_updated_at "IX"
        datetime pr_merged_at "IX"
        datetime pr_closed_at "IX"
    }

    PullRequestReview {
        int id PK
        int pr_id FK
        int account_id FK
        bigint pr_review_id UK "IX"
        text body
        bigint in_reply_to_id
        datetime pr_review_created_at "IX"
        datetime pr_review_updated_at "IX"
    }

    PullRequestComment {
        int id PK
        int pr_id FK
        int account_id FK
        bigint pr_comment_id UK "IX"
        text body
        datetime pr_comment_created_at "IX"
        datetime pr_comment_updated_at "IX"
    }

    PullRequestAssignee {
        int id PK
        int pr_id FK
        int account_id FK
        datetime created_at
    }

    PullRequestLabel {
        int id PK
        int pr_id FK
        string label_name "IX"
        datetime created_at
    }
Loading

Note: Enum types:

  • PullRequest.state: pull_request_state enum with values: open, closed, merged

Note: Composite unique constraints should be applied on: (repo_id, pr_number) in PullRequest, (pr_id, account_id) in PullRequestAssignee, (pr_id, label_name) in PullRequestLabel.


2b. Clang GitHub Tracker (clang_github_tracker)

Standalone tables for the llvm/llvm-project (or CLANG_GITHUB_OWNER / CLANG_GITHUB_REPO) mirror. No foreign keys to other apps.

Model Purpose
ClangGithubIssueItem One row per issue or PR number (unique). is_pull_request distinguishes types. github_created_at / github_updated_at mirror GitHub API times; github_updated_at (with Max + 1ms) drives API fetch resume. Django updated_at (auto_now) bumps on every upsert and drives Pinecone incrementality vs PineconeSyncStatus.final_sync_at.
ClangGithubCommit One row per sha (unique, 40-char hex). github_committed_at is the author/committer date used for commit fetch watermarks.

Raw JSON remains under workspace/raw/github_activity_tracker/<owner>/<repo>/ (same layout as other raw GitHub activity).


3. Boost Library Tracker

Part 1: Boost Library, Headers, and Dependencies

erDiagram
    GitHubRepository ||--o| BoostLibraryRepository : "extend"
    GitHubFile ||--o| BoostFile : "extend"
    BoostLibraryRepository ||--o{ BoostLibrary : "has"
    BoostLibrary ||--o{ BoostFile : "has"
    BoostLibrary ||--o{ BoostDependency : "client_library"
    BoostLibrary ||--o{ BoostDependency : "dep_library"
    BoostLibrary ||--o{ BoostLibraryVersion : "has"
    BoostLibrary ||--o{ DependencyChangeLog : "client_library"
    BoostLibrary ||--o{ DependencyChangeLog : "dep_library"
    BoostVersion ||--o{ BoostDependency : "version"
    BoostVersion ||--o{ BoostLibraryVersion : "version"

    BoostLibraryRepository {
        datetime created_at
        datetime updated_at
    }

    BoostLibrary {
        int id PK
        int repo_id FK
        string name "IX"
    }

    BoostFile {
        int library_id FK
    }

    BoostVersion {
        int id PK
        string version UK "IX"
        datetime version_created_at
    }

    BoostLibraryVersion {
        int id PK
        int library_id FK
        int version_id FK
        string cpp_version
        text description
        string documentation
        string key
        datetime created_at
        datetime updated_at
    }

    BoostDependency {
        int id PK
        int client_library_id FK
        int version_id FK
        int dep_library_id FK
        datetime created_at
    }

    DependencyChangeLog {
        int id PK
        int client_library_id FK
        int dep_library_id FK
        boolean is_add
        date created_at "IX"
    }
Loading

Note: BoostLibraryRepository extends GitHubRepository and adds created_at, updated_at (and any other app-specific fields); it inherits all repository fields from GitHubRepository.

Note: BoostFile extends GitHubFile and only adds library_id (inherits id, repo_id, filename, created_at, is_deleted from GitHubFile).

Note: Composite unique constraints should be applied on: (library_id, version_id) in BoostLibraryVersion, (client_library_id, version_id, dep_library_id) in BoostDependency, (client_library_id, dep_library_id, created_at) in DependencyChangeLog.

Part 2: Boost Library Versions, Maintainers, Authors, and Categories

erDiagram
    GitHubAccount ||--o{ BoostLibraryRoleRelationship : "role"
    BoostLibraryVersion ||--o{ BoostLibraryRoleRelationship : "has"
    BoostLibrary ||--o{ BoostLibraryCategoryRelationship : "has"
    BoostLibraryCategory ||--o{ BoostLibraryCategoryRelationship : "category"

    BoostLibraryRoleRelationship {
        int id PK
        int library_version_id FK
        int account_id FK
        boolean is_maintainer
        boolean is_author
        datetime created_at
        datetime updated_at
    }

    BoostLibraryCategory {
        int id PK
        string name UK "IX"
        datetime created_at
        datetime updated_at
    }

    BoostLibraryCategoryRelationship {
        int id PK
        int library_id FK
        int category_id FK
        datetime created_at
        datetime updated_at
    }
Loading

Note: Composite unique constraints should be applied on: (library_version_id, account_id) in BoostLibraryRoleRelationship, (library_id, category_id) in BoostLibraryCategoryRelationship.


4. Boost Usage Tracker

erDiagram
    direction LR
    GitHubRepository ||--o| BoostExternalRepository : "extend"
    BoostExternalRepository ||--o{ BoostUsage : "has"
    BoostUsage }o--|| "BoostFile (defined in Boost Library Tracker)" : "Boost header file"
    BoostUsage }o--|| "GitHubFile (defined in GitHub Activity Tracker)" : "current file path"
    BoostUsage ||--o{ BoostMissingHeaderTmp : "temporary missing header"

    BoostExternalRepository {
        string boost_version "IX"
        boolean is_boost_embedded
        boolean is_boost_used
        datetime created_at
        datetime updated_at
    }

    BoostUsage {
        int id PK
        int repo_id FK
        BigInt boost_header_id FK "Nullable"
        BigInt file_path_id FK
        datetime last_commit_date "IX"
        date excepted_at
        datetime created_at
        datetime updated_at
    }

    BoostMissingHeaderTmp {
        int id PK
        int usage_id FK "references BoostUsage.id"
        string header_name
        datetime created_at
    }

Loading

Note: BoostMissingHeaderTmp temporarily stores usage history when the Boost include path (header_name) does not yet exist in the Boost/GitHub file tables (e.g. BoostFile or GitHubFile). usage_id references BoostUsage.id. Once the header is added to the catalog, these records can be processed (e.g. backfilled into BoostUsage with a resolved boost_header_id) and optionally removed.

Note: BoostExternalRepository extends GitHubRepository and only adds boost_version, is_boost_embedded, is_boost_used, created_at, updated_at. Repository identity and metadata (e.g. owner, repo_name, stars, forks, description, repo_pushed_at, repo_created_at, repo_updated_at) are inherited from GitHubRepository.

Note: BoostUsage links each external repository to a Boost header file and to the file path where it is used: boost_header_id references BoostFile (defined in Boost Library Tracker; extends GitHubFile, only adds library_id) for the Boost header; file_path_id references GitHubFile (defined in GitHub Activity Tracker) for the current file path in that repo. This tracks which external repos use which Boost files and in which files they appear.

Note: A composite unique constraint should be applied on (repo_id, boost_header_id, file_path_id) in BoostUsage.

Note: BoostMissingHeaderTmp.usage_id references BoostUsage.id (FK). Consider an index on usage_id and on header_name for lookups and backfill.


5. Boost Mailing List Tracker

erDiagram
    direction LR
    MailingListProfile ||--o{ MailingListMessage : "sender"

    MailingListMessage {
        int id PK
        int sender_id FK
        string msg_id UK "IX"
        string parent_id "IX"
        string thread_id "IX"
        string subject
        text content
        string list_name "IX"
        datetime sent_at "IX"
        datetime created_at
    }
Loading

Note: MailingListProfile extends BaseProfile (section 1) and represents the mailing list user/account. sender_id in MailingListMessage references this profile.


6. Slack Activity Tracker

erDiagram
    direction LR
    SlackTeam ||--o{ SlackChannel : "has"
    SlackChannel ||--o{ SlackChannelMembershipChangeLog : "has"
    SlackChannel ||--o{ SlackChannelMembership : "has"
    SlackChannel ||--o{ SlackMessage : "contains"
    SlackUser ||--o{ SlackMessage : "author"
    SlackUser ||--o{ SlackChannelMembership : "member"
    SlackUser ||--o{ SlackChannelMembershipChangeLog : "user"
    SlackChannel }o--|| SlackUser : "creator"

    SlackTeam {
        int id PK
        string team_id UK "IX"
        string team_name
        datetime created_at
        datetime updated_at
    }

    SlackChannel {
        int id PK
        int team_id FK
        string channel_id UK "IX"
        string channel_name "IX"
        string channel_type
        text description
        string creator_user_id
        datetime created_at
        datetime updated_at
    }

    SlackChannelMembershipChangeLog {
        int id PK
        int channel_id FK
        string slack_user_id "IX"
        boolean is_joined
        datetime created_at
    }

    SlackChannelMembership {
        int id PK
        int channel_id FK
        string slack_user_id "IX"
        boolean is_restricted
        boolean is_deleted
        datetime created_at
        datetime updated_at
    }

    SlackMessage {
        bigint id PK
        int channel_id FK
        string ts UK "IX"
        string slack_user_id "IX"
        text message
        string thread_ts "IX"
        datetime slack_message_created_at "IX"
        datetime slack_message_updated_at "IX"
    }
Loading

Note: SlackUser extends BaseProfile (section 1). SlackMessage, SlackChannelMembership, and SlackChannelMembershipChangeLog reference users via slack_user_id; SlackChannel references the channel creator (SlackUser) via creator_user_id.

Note: Composite unique constraints should be applied on: (channel_id, ts) in SlackMessage, (channel_id, slack_user_id, created_at) in SlackChannelMembershipChangeLog, (channel_id, slack_user_id) in SlackChannelMembership.


7. WG21 Papers Tracker

erDiagram
    Direction LR
    WG21PaperAuthorProfile ||--o{ WG21PaperAuthor : "author"
    WG21Mailing ||--o{ WG21Paper : "has"
    WG21PaperAuthor }o--|| WG21Paper : "has"

    WG21PaperAuthor {
        int id PK
        int paper_id FK
        int profile_id FK
        int author_order
        datetime created_at
    }

    WG21Mailing {
        int id PK
        string mailing_date UK "IX"
        string title
        datetime created_at
        datetime updated_at
    }

    WG21Paper {
        int id PK
        string paper_id "IX"
        int year "IX"
        string url
        string title "IX"
        date document_date "IX"
        int mailing_id FK "IX"
        string subgroup "IX"
        boolean is_downloaded "IX"
        datetime created_at
        datetime updated_at
    }
Loading

Note: WG21PaperAuthorProfile extends BaseProfile (section 1). profile_id in WG21PaperAuthor references this profile; each paper can have multiple authors.

Note: WG21Mailing stores information about the mailing release, identified by mailing_date (e.g. "2025-03"). mailing_id in WG21Paper references this mailing.

Note: WG21Paper is uniquely identified by the composite (paper_id, year); paper_id is not globally unique. The same paper identifier may appear in different years (e.g. revisions).

Note: Composite unique constraint should be applied on (paper_id, profile_id) in WG21PaperAuthor. author_order is optional and 1-based; it indicates the order of authors on the paper.


8. Boost Website Analytics Tracker

erDiagram
    Website {
        int id PK
        date stat_date UK "IX"
        int website_visit_count
    }

    WebsiteVisitCount {
        int id PK
        date stat_date "IX"
        string country "IX"
        int count
    }

    WebsiteWordCount {
        int id PK
        date stat_date "IX"
        string word "IX"
        int count
    }
Loading

Note: Website stores daily site visit totals by stat_date. WebsiteVisitCount and WebsiteWordCount store per-country and per-word statistics; both reference the same stat_date for aggregation.

Note: Composite unique constraints should be applied on: (stat_date, country) in WebsiteVisitCount, (stat_date, word) in WebsiteWordCount.


9. CPPA Pinecone Sync

erDiagram
    PineconeFailList {
        int id PK
        string failed_id "IX"
        string app_type "IX"
        datetime created_at
    }

    PineconeSyncStatus {
        int id PK
        string app_type UK "IX"
        datetime final_sync_at
        datetime created_at
        datetime updated_at
    }
Loading

Note: PineconeFailList - Records failed sync operations by failed_id and app_type for retry or audit.

Note: PineconeSyncStatus - Tracks the last successful sync per app. One row per app_type. final_sync_at is when the last sync for that type completed; created_at and updated_at are for the row.


10. Boost Library Docs Tracker

erDiagram
    BoostLibraryVersion ||--o{ BoostLibraryDocumentation : "has"
    BoostVersion ||--o{ BoostDocContent : "has"
    BoostDocContent ||--o{ BoostLibraryDocumentation : "used_in"

    BoostDocContent {
        int id PK
        text url "IX"
        string content_hash UK "IX"
        int first_version_id FK
        int last_version_id FK
        boolean is_upserted
        datetime scraped_at
        datetime created_at
    }

    BoostLibraryDocumentation {
        int id PK
        int boost_library_version_id FK
        int boost_doc_content_id FK
        datetime created_at
    }
Loading

Note: BoostDocContent stores one globally unique scraped page per content hash. One row per unique content_hash regardless of version or library. Page content is not stored in the DB; it is kept in workspace files. content_hash (SHA-256 of the page text) is the unique key — the same URL may produce a new row if the content changes. first_version_id / last_version_id track the earliest and latest Boost version in which this page content was observed. is_upserted tracks whether the page has been successfully upserted to Pinecone. scraped_at is updated each time the page is re-fetched.

Note: BoostLibraryDocumentation is the join table between BoostLibraryVersion (section 3) and BoostDocContent. One row per (library-version, page) pair — it records which pages were found under a given (library, version) combination.

Note: Unique constraint on content_hash in BoostDocContent. Composite unique constraint on (boost_library_version_id, boost_doc_content_id) in BoostLibraryDocumentation. Index on boost_library_version_id in BoostLibraryDocumentation for efficient per-library-version queries.


10. CPPA YouTube Script Tracker

Stores YouTube video metadata, VTT transcripts, speaker links, and community tags for C++ conference talks (CppCon, C++Now, Meeting C++, etc.).

  • YouTubeChannel — publisher channel; channel_id is the primary key.
  • YouTubeVideo — video metadata and transcript state; video_id is the primary key.
  • YouTubeVideoSpeaker — M2M join between YouTubeVideo and cppa_user_tracker.YoutubeSpeaker.
  • CppaTags — C++ community tag vocabulary (e.g. concurrency, templates, modules).
  • YouTubeVideoTags — M2M join between YouTubeVideo and CppaTags.

Workspace layout:

workspace/
├── cppa_youtube_script_tracker/
│   └── metadata/{video_id}.json        # short-lived queue; moved to raw after DB persist
└── raw/
    └── cppa_youtube_script_tracker/
        ├── metadata/{video_id}.json    # permanent archive
        └── transcripts/{video_id}.en.vtt  # permanent archive
erDiagram
    direction TB
    YoutubeSpeaker ||--o{ YouTubeVideoSpeaker : "appears_in"
    YouTubeVideo ||--o{ YouTubeVideoSpeaker : "has"
    YouTubeChannel ||--o{ YouTubeVideo : "hosts"
    YouTubeVideo ||--o{ YouTubeVideoTags : "has"
    CppaTags ||--o{ YouTubeVideoTags : "tagged_in"

    YouTubeChannel {
        string channel_id PK
        string channel_title
        datetime created_at
        datetime updated_at
    }

    YouTubeVideo {
        string video_id PK
        string channel_id FK
        string title
        text description
        datetime published_at "IX"
        int duration_seconds
        int view_count
        int like_count
        int comment_count
        string search_term
        bool has_transcript
        string transcript_path
        datetime scraped_at
        datetime created_at
        datetime updated_at
    }

    YouTubeVideoSpeaker {
        int id PK
        string video_id FK
        int speaker_id FK
        datetime created_at
    }

    CppaTags {
        int id PK
        string tag_name "UK IX"
    }

    YouTubeVideoTags {
        int id PK
        string youtube_video_id FK
        int cppa_tag_id FK
    }

    YoutubeSpeaker {
        int baseprofile_ptr_id PK
        string display_name "IX"
    }
Loading

Note: YoutubeSpeaker is defined in cppa_user_tracker (section 1) and extends BaseProfile. It is identified solely by display_name (same pattern as MailingListProfile and WG21PaperAuthorProfile).

Note: YouTubeVideoSpeaker has a unique constraint on (video, speaker).

Note: YouTubeVideoTags has a unique constraint on (youtube_video, cppa_tag). CppaTags.tag_name values are stored lowercase.


11. Discord Activity Tracker (discord_activity_tracker)

Guilds, channels, messages, and reactions ingested from DiscordChatExporter JSON (see service_api/discord_activity_tracker.md). Discord user rows live in cppa_user_tracker.DiscordProfile (extends BaseProfile, section 1); this app only stores server/channel/message/reaction tables.

erDiagram
    direction LR
    DiscordServer ||--o{ DiscordChannel : "has"
    DiscordChannel ||--o{ DiscordMessage : "contains"
    DiscordProfile ||--o{ DiscordMessage : "author"
    DiscordMessage ||--o{ DiscordReaction : "has"

    DiscordServer {
        bigint server_id "UK IX"
        string server_name "IX"
        string icon_url
        datetime created_at
        datetime updated_at
    }

    DiscordChannel {
        int id PK
        int server_id FK
        bigint channel_id "UK IX"
        string channel_name "IX"
        string channel_type
        bigint category_id "IX nullable"
        string category_name
        text topic
        int position
        datetime last_synced_at "IX"
        datetime last_activity_at "IX"
        datetime created_at
        datetime updated_at
    }

    DiscordMessage {
        int id PK
        int channel_id FK
        int author_id FK
        bigint message_id "UK IX"
        text content
        string message_type "IX default Default"
        boolean is_pinned "IX"
        datetime message_created_at "IX"
        datetime message_edited_at
        boolean is_deleted "IX"
        datetime deleted_at
        bigint reply_to_message_id "IX nullable"
        boolean has_attachments
        json attachment_urls
        datetime created_at
        datetime updated_at
    }

    DiscordReaction {
        int id PK
        int message_id FK
        string emoji "IX"
        int count
        datetime created_at
        datetime updated_at
    }

    DiscordProfile {
        int baseprofile_ptr_id PK "FK BaseProfile"
        bigint discord_user_id "UK IX"
        string username "IX"
    }
Loading

Note: DiscordServer is keyed by Discord guild snowflake server_id (unique). DiscordChannel is keyed by channel_id (unique); server_id FK uses db_column="server_id" to the parent server’s PK id (Django default), not the snowflake — join in ORM via channel.server.server_id when you need the guild snowflake.

Note: DiscordMessage is keyed by message_id (Discord snowflake, unique). author_idDiscordProfile (cppa_user_tracker). reply_to_message_id stores the parent message snowflake when the message is a reply (no FK to another row). message_type and is_pinned mirror exporter metadata (migration 0005).

Note: DiscordReaction has a unique constraint on (message, emoji) (discord_activity_tracker_msg_emoji_uniq).


Appendix

Appendix A: Table summary

Table Description Section
BaseProfile Base table for profiles; extended by platform-specific profile tables. Has identity_id FK to Identity. 1
Identity Top-level user/account; one identity can have multiple BaseProfiles. 1
Email Email addresses linked to BaseProfile (one profile, many emails). 1
GitHubAccount Profile for GitHub (user/org/enterprise); extends BaseProfile. 1
SlackUser Profile for Slack; extends BaseProfile. 1
MailingListProfile Profile for mailing list; extends BaseProfile. 1
WG21PaperAuthorProfile Profile for WG21 paper authors; extends BaseProfile. 1
DiscordProfile Discord user profile (cppa_user_tracker); extends BaseProfile. discord_user_id UK; used as DiscordMessage.author. 1, 11
TmpIdentity Temporary identity for staging (CPPA User Tracker). 1
TempProfileIdentityRelation Staging table: base_profile_id -> target_identity_id (CPPA User Tracker). 1
GitHubRepository Repository metadata (owner, repo_name, stars, forks, etc.). Base table for repo subtypes. 2
GitHubFile File in a repo (filename, repo_id, is_deleted). Base for file subtypes. 2
Language Reference: language name. 2
CreatedReposByLanguage Yearly repository counts by language (all_repos, significant_repos; unique on language_id + year). 2
License Reference: license name, spdx_id, url. 2
RepoLanguage Repo-language link with line_count. 2
RepoLicense Repo-license link. 2
GitCommit Commit in a repo (hash, committer, comment, commit_at). 2
GitCommitFileChange Per-commit file change (links commit, GitHubFile, status, additions, deletions, patch). 2
Issue GitHub issue (repo, creator, number, title, body, state, labels, assignees). 2
IssueComment Comment on an issue. 2
IssueAssignee Issue-assignee link. 2
IssueLabel Issue-label name. 2
PullRequest PR (repo, creator, number, title, body, state, head_hash, base_hash, dates). 2
PullRequestReview Review on a PR. 2
PullRequestComment Comment on a PR. 2
PullRequestAssignee PR-assignee link. 2
PullRequestLabel PR-label name. 2
ClangGithubIssueItem Clang mirror: one row per issue/PR number (no FKs); GitHub timestamps + Django updated_at for Pinecone incrementality. 2b
ClangGithubCommit Clang mirror: one row per commit SHA (no FKs); github_committed_at for fetch watermark. 2b
BoostLibraryRepository Extends GitHubRepository; adds created_at, updated_at (Boost repos). 3
BoostLibrary Library within a Boost repo (name). 3
BoostFile Extends GitHubFile; adds library_id (file in a Boost library). 3
BoostVersion Reference: Boost version string. 3
BoostLibraryVersion Library-version link (cpp_version, description). 3
BoostDependency Library dependency (client_library, version, dep_library). 3
DependencyChangeLog Log of dependency add/remove (client_library, dep_library, is_add, created_at). 3
BoostLibraryRoleRelationship Library version-account link (maintainer/author). 3
BoostLibraryCategory Reference: category name. 3
BoostLibraryCategoryRelationship Library-category link. 3
BoostExternalRepository Extends GitHubRepository; adds boost_version, is_boost_embedded, is_boost_used. 4
BoostUsage External repo use of Boost (repo, boost_header_id, file_path_id, last_commit_date). 4
BoostMissingHeaderTmp Temporary usage records when header_name is not yet in BoostFile/GitHubFile (usage_id→BoostUsage.id). 4
MailingListMessage Mailing list message (sender_id->MailingListProfile, msg_id, subject, content, list_name, sent_at). 5
SlackTeam Slack workspace (team_id, team_name). 6
SlackChannel Channel in a team (channel_id, name, type, creator_user_id). 6
SlackMessage Message in a channel (ts, slack_user_id, message, thread_ts). 6
SlackChannelMembership Channel-member link (slack_user_id, is_restricted, is_deleted). 6
SlackChannelMembershipChangeLog Log of join/leave (slack_user_id, is_joined, created_at). 6
WG21Paper WG21 paper (paper_id, url, title, publication_date). 7
WG21PaperAuthor Paper-author link (paper_id, profile_id->WG21PaperAuthorProfile). 7
Website Daily site visit total (stat_date, website_visit_count). 8
WebsiteVisitCount Per-date, per-country visit count. 8
WebsiteWordCount Per-date, per-word count. 8
PineconeFailList Failed sync records (failed_id, type) for retry/audit. 9
PineconeSyncStatus Last sync per type (app_type, final_sync_at, …); includes Discord when PINECONE_DISCORD_APP_TYPE is set. 9
YoutubeSpeaker Profile for YouTube speakers; extends BaseProfile. Identified by display_name. 1, 10
YouTubeChannel Publisher channel; channel_id is PK (no auto-increment id). 10
YouTubeVideo Video metadata, transcript state, and channel FK; video_id is PK (no auto-increment id). 10
YouTubeVideoSpeaker M2M join between YouTubeVideo and YoutubeSpeaker (video_id, speaker_id). 10
CppaTags C++ community tag vocabulary (tag_name, unique/lowercase). 10
YouTubeVideoTags M2M join between YouTubeVideo and CppaTags (youtube_video_id, cppa_tag_id). 10
DiscordServer Discord guild (server_id snowflake UK, name, icon). 11
DiscordChannel Channel in a guild (channel_id UK, type, category, topic, sync/activity timestamps). 11
DiscordMessage Message (message_id UK, content, type, pin, reply_to, attachments JSON, soft-delete flags). 11
DiscordReaction Emoji aggregate per message (unique on message + emoji). 11
BoostDocContent Globally unique scraped page by content hash (url, content_hash UK, first_version_id, last_version_id, is_upserted, scraped_at). One row per unique content hash across all versions. 10
BoostLibraryDocumentation Join table: BoostLibraryVersion × BoostDocContent. Records which pages belong to each (library, version) pair. 10

Appendix B: Relationship summary

From To Relationship
Identity BaseProfile One identity has many profiles
BaseProfile Email One profile has many emails
BaseProfile GitHubAccount, SlackUser, MailingListProfile, WG21PaperAuthorProfile, DiscordProfile, YoutubeSpeaker Extends (1:1 subtype)
TmpIdentity TempProfileIdentityRelation Has many (target)
TempProfileIdentityRelation BaseProfile Has many (base_profile_id)
GitHubAccount GitHubRepository Owns many
GitHubRepository RepoLanguage, RepoLicense Has many
Language CreatedReposByLanguage Has many yearly stats
GitHubRepository BoostLibraryRepository, BoostExternalRepository Extends (1:1 subtype)
GitHubRepository GitCommit, Issue, PullRequest Contains many
GitHubRepository GitHubFile Has many
GitHubFile BoostFile Extends (1:1 subtype)
GitHubFile GitCommitFileChange Changed in (many file changes)
GitCommit GitCommitFileChange Has many
Issue IssueComment, IssueAssignee, IssueLabel Has many
PullRequest PullRequestReview, PullRequestComment, PullRequestAssignee, PullRequestLabel Has many
GitHubAccount GitCommit, Issue, IssueComment, IssueAssignee, PullRequest, PullRequestReview, PullRequestComment, PullRequestAssignee Committer/creator/author/assignee/reviewer
BoostLibraryRepository BoostLibrary Has many
BoostLibrary BoostFile, BoostDependency (client/dep), BoostLibraryVersion, DependencyChangeLog Has many
BoostLibrary BoostLibraryCategoryRelationship Has many
BoostVersion BoostDependency, BoostLibraryVersion Version
BoostLibraryVersion BoostLibraryRoleRelationship Has many
GitHubAccount BoostLibraryRoleRelationship Role (maintainer/author)
BoostLibraryCategory BoostLibraryCategoryRelationship Category
BoostExternalRepository BoostUsage Has many
BoostUsage BoostFile, GitHubFile References (boost header, file path)
BoostUsage BoostMissingHeaderTmp Has many (temporary missing-header records)
MailingListProfile MailingListMessage Sender (has many messages)
SlackTeam SlackChannel Has many
SlackChannel SlackMessage, SlackChannelMembership, SlackChannelMembershipChangeLog Contains / has many
SlackUser SlackMessage, SlackChannelMembership, SlackChannelMembershipChangeLog Author / member / user
SlackChannel SlackUser Creator (many-to-one)
WG21PaperAuthorProfile WG21PaperAuthor Author (has many)
WG21Paper WG21PaperAuthor Has many authors
YoutubeSpeaker YouTubeVideoSpeaker Appears in (many videos)
YouTubeChannel YouTubeVideo Hosts many videos
YouTubeVideo YouTubeVideoSpeaker Has many speakers
YouTubeVideo YouTubeVideoTags Has many tags
CppaTags YouTubeVideoTags Tagged in many videos
DiscordServer DiscordChannel Has many channels
DiscordChannel DiscordMessage Contains many messages
DiscordProfile DiscordMessage Author (has many messages)
DiscordMessage DiscordReaction Has many reactions
BoostLibraryVersion BoostLibraryDocumentation Has many (boost_library_version_id)
BoostDocContent BoostLibraryDocumentation Used in many (boost_doc_content_id)