The Boost Data Collector is a Django project backed by a single database (boost_dashboard). All sub-apps use this same database: they do not use separate databases or schema-based isolation. Tables from different apps are linked by relationships (e.g. foreign keys across apps), so data can be joined and reused. Reference data (e.g. Language, License) is defined and owned by one app; other apps reference it via foreign keys rather than duplicating it. The diagrams below show these shared base tables and app-specific tables and how they connect.
Legend: PK = Primary Key, FK = Foreign Key, UK = Unique Key, IX = Index
erDiagram
direction TB
Email }o-- || BaseProfile : "has"
BaseProfile ||--o| GitHubAccount : "extends"
BaseProfile ||--o| SlackUser : "extends"
BaseProfile ||--o| MailingListProfile : "extends"
BaseProfile ||--o| WG21PaperAuthorProfile : "extends"
BaseProfile ||--o| YoutubeSpeaker : "extends"
BaseProfile ||--o| DiscordProfile : "extends"
Identity }o--|| BaseProfile : "has"
TempProfileIdentityRelation ||--o{ BaseProfile : "has"
TmpIdentity ||--o{ TempProfileIdentityRelation : "has"
BaseProfile {
int id PK "auto-increment"
int identity_id FK
enum type
}
Email {
int id PK
int base_profile_id FK
string email "IX"
boolean is_primary
boolean is_active
datetime created_at
datetime updated_at
}
GitHubAccount {
bigint github_account_id "IX"
string username "IX"
string display_name "IX"
string avatar_url
enum type
datetime created_at
datetime updated_at
}
SlackUser {
string slack_user_id "IX"
string username "IX"
string display_name "IX"
string avatar_url
datetime created_at
datetime updated_at
}
MailingListProfile {
string display_name "IX"
datetime created_at
datetime updated_at
}
WG21PaperAuthorProfile {
string display_name "IX"
string author_alias "IX"
datetime created_at
datetime updated_at
}
YoutubeSpeaker {
string display_name "IX"
datetime created_at
datetime updated_at
}
DiscordProfile {
bigint discord_user_id "UK IX"
string username "IX"
string display_name "IX"
string avatar_url
boolean is_bot
datetime created_at
datetime updated_at
}
Identity {
int id PK
string display_name "IX"
text description
datetime created_at
datetime updated_at
}
TmpIdentity {
int id PK
string display_name "IX"
text description
datetime created_at
datetime updated_at
}
TempProfileIdentityRelation {
int id PK
int base_profile_id FK
int target_identity_id FK
datetime created_at
datetime updated_at
}
Note: Each extended table has id as primary key and foreign key to BaseProfile.id. The value is the same: one auto-increment in BaseProfile, and that same id is stored in exactly one extended profile row. Other tables (e.g. GitCommit, Issue) reference the profile via this single id. DiscordProfile (in cppa_user_tracker) is the author profile for DiscordMessage rows in discord_activity_tracker (author_id → DiscordProfile.id).
Note: The Email table references BaseProfile via base_profile_id (FK to BaseProfile.id). One profile can have multiple email addresses; is_primary marks the primary email; is_active indicates whether the email is currently active. Other tables (e.g. MailingListMessage) can link to a profile via Email. Note: The email field is not unique; the same email address may appear in multiple rows (e.g. for different profiles or over time).
Note: The type field is a PostgreSQL enum (or equivalent) with values: github, slack, mailing_list, wg21, discord, youtube. It identifies which extended table the row belongs to.
Note: In GitHubAccount, the type field is an enum with values: user, organization, enterprise (identifies whether the GitHub account is a user, organization, or enterprise).
Note: BaseProfile references Identity via identity_id (FK to Identity.id). One identity can have multiple BaseProfiles (e.g. one person with GitHub and Slack). Identity, TmpIdentity, and TempProfileIdentityRelation are used by the CPPA User Tracker: Identity holds the canonical user/account; TmpIdentity and TempProfileIdentityRelation stage temporary profile-to-identity relations (e.g. base_profile_id, target_identity_id) before merging.
GitHubRepository is the base table with all repository fields. GitHubAccount owns GitHubRepository; RepoLanguage and RepoLicense reference the repository.
erDiagram
Direction LR
GitHubAccount ||--o{ GitHubRepository : "owns"
GitHubRepository ||--o{ RepoLanguage : "has"
GitHubRepository ||--o{ RepoLicense : "has"
RepoLanguage }o--|| Language : "used_in"
Language ||--o{ CreatedReposByLanguage : "yearly_stats"
License ||--o{ RepoLicense : "used_in"
GitHubRepository {
int id PK
int owner_account_id FK
string repo_name "IX"
int stars
int forks
text description
datetime repo_pushed_at "IX"
datetime repo_created_at "IX"
datetime repo_updated_at "IX"
}
Language {
int id PK
string name UK "IX"
datetime created_at
}
License {
int id PK
string name UK "IX"
string spdx_id "IX"
string url
datetime created_at
}
RepoLanguage {
int id PK
int repo_id FK
int language_id FK
int line_count
datetime created_at
datetime updated_at
}
RepoLicense {
int id PK
int repo_id FK
int license_id FK
datetime created_at
}
CreatedReposByLanguage {
int id PK
int language_id FK
int year "IX"
int all_repos
int significant_repos
datetime created_at
datetime updated_at
}
Note: GitHubRepository is the base table with all repository fields.
Note: Composite unique constraints should be applied on: (owner_account_id, repo_name) in GitHubRepository, (repo_id, language_id) in RepoLanguage, (repo_id, license_id) in RepoLicense, (language_id, year) in CreatedReposByLanguage.
erDiagram
GitHubAccount ||--o{ GitCommit : "committer"
GitHubAccount ||--o{ Issue : "creator"
GitHubAccount ||--o{ IssueComment : "author"
GitHubAccount ||--o{ IssueAssignee : "assignee"
GitHubRepositoy ||--o{ GitCommit : "contains"
GitHubRepositoy ||--o{ Issue : "contains"
GitHubRepositoy ||--o{ GitHubFile : "has"
GitCommit ||--o{ GitCommitFileChange : "has"
GitHubFile ||--o{ GitCommitFileChange : "changed_in"
Issue ||--o{ IssueComment : "has"
Issue ||--o{ IssueAssignee : "has"
Issue ||--o{ IssueLabel : "has"
GitCommit {
bigint id PK
int repo_id FK
int account_id FK
string commit_hash UK "IX"
text comment
datetime commit_at "IX"
}
GitHubFile {
BigInt id PK
int repo_id FK
string filename "IX"
boolean is_deleted
datetime created_at
}
GitCommitFileChange {
int id PK
bigint commit_id FK
int github_file_id FK "IX"
enum status "IX"
int additions
int deletions
text patch
datetime created_at
}
Issue {
int id PK
int repo_id FK
int account_id FK
int issue_number "IX"
bigint issue_id UK "IX"
text title
text body
enum state "IX"
enum state_reason
datetime issue_created_at "IX"
datetime issue_updated_at "IX"
datetime issue_closed_at "IX"
}
IssueComment {
int id PK
int issue_id FK
int account_id FK
bigint issue_comment_id UK "IX"
text body
datetime issue_comment_created_at
datetime issue_comment_updated_at
}
IssueAssignee {
int id PK
int issue_id FK
int account_id FK
datetime created_at
}
IssueLabel {
int id PK
int issue_id FK
string label_name "IX"
datetime created_at
}
Note: Enum types:
GitCommitFileChange.status:file_change_statusenum with values:added,modified,removed,renamed,copied,changedIssue.state:issue_stateenum with values:open,closedIssue.state_reason:issue_state_reasonenum with values:completed,not_planned,reopened,null
Note: Composite unique constraints should be applied on: (repo_id, commit_hash) in GitCommit, (commit_id, github_file_id) in GitCommitFileChange, (repo_id, issue_number) in Issue, (issue_id, account_id) in IssueAssignee, (issue_id, label_name) in IssueLabel.
erDiagram
GitHubAccount ||--o{ PullRequest : "creator"
GitHubAccount ||--o{ PullRequestReview : "reviewer"
GitHubAccount ||--o{ PullRequestComment : "author"
GitHubAccount ||--o{ PullRequestAssignee : "assignee"
GitHubRepository ||--o{ PullRequest : "contains"
PullRequest ||--o{ PullRequestReview : "has"
PullRequest ||--o{ PullRequestComment : "has"
PullRequest ||--o{ PullRequestAssignee : "has"
PullRequest ||--o{ PullRequestLabel : "has"
PullRequest {
int id PK
int repo_id FK
int account_id FK
int pr_number "IX"
bigint pr_id UK "IX"
text title
text body
enum state "IX"
string head_hash "IX"
string base_hash "IX"
datetime pr_created_at "IX"
datetime pr_updated_at "IX"
datetime pr_merged_at "IX"
datetime pr_closed_at "IX"
}
PullRequestReview {
int id PK
int pr_id FK
int account_id FK
bigint pr_review_id UK "IX"
text body
bigint in_reply_to_id
datetime pr_review_created_at "IX"
datetime pr_review_updated_at "IX"
}
PullRequestComment {
int id PK
int pr_id FK
int account_id FK
bigint pr_comment_id UK "IX"
text body
datetime pr_comment_created_at "IX"
datetime pr_comment_updated_at "IX"
}
PullRequestAssignee {
int id PK
int pr_id FK
int account_id FK
datetime created_at
}
PullRequestLabel {
int id PK
int pr_id FK
string label_name "IX"
datetime created_at
}
Note: Enum types:
PullRequest.state:pull_request_stateenum with values:open,closed,merged
Note: Composite unique constraints should be applied on: (repo_id, pr_number) in PullRequest, (pr_id, account_id) in PullRequestAssignee, (pr_id, label_name) in PullRequestLabel.
Standalone tables for the llvm/llvm-project (or CLANG_GITHUB_OWNER / CLANG_GITHUB_REPO) mirror. No foreign keys to other apps.
| Model | Purpose |
|---|---|
| ClangGithubIssueItem | One row per issue or PR number (unique). is_pull_request distinguishes types. github_created_at / github_updated_at mirror GitHub API times; github_updated_at (with Max + 1ms) drives API fetch resume. Django updated_at (auto_now) bumps on every upsert and drives Pinecone incrementality vs PineconeSyncStatus.final_sync_at. |
| ClangGithubCommit | One row per sha (unique, 40-char hex). github_committed_at is the author/committer date used for commit fetch watermarks. |
Raw JSON remains under workspace/raw/github_activity_tracker/<owner>/<repo>/ (same layout as other raw GitHub activity).
erDiagram
GitHubRepository ||--o| BoostLibraryRepository : "extend"
GitHubFile ||--o| BoostFile : "extend"
BoostLibraryRepository ||--o{ BoostLibrary : "has"
BoostLibrary ||--o{ BoostFile : "has"
BoostLibrary ||--o{ BoostDependency : "client_library"
BoostLibrary ||--o{ BoostDependency : "dep_library"
BoostLibrary ||--o{ BoostLibraryVersion : "has"
BoostLibrary ||--o{ DependencyChangeLog : "client_library"
BoostLibrary ||--o{ DependencyChangeLog : "dep_library"
BoostVersion ||--o{ BoostDependency : "version"
BoostVersion ||--o{ BoostLibraryVersion : "version"
BoostLibraryRepository {
datetime created_at
datetime updated_at
}
BoostLibrary {
int id PK
int repo_id FK
string name "IX"
}
BoostFile {
int library_id FK
}
BoostVersion {
int id PK
string version UK "IX"
datetime version_created_at
}
BoostLibraryVersion {
int id PK
int library_id FK
int version_id FK
string cpp_version
text description
string documentation
string key
datetime created_at
datetime updated_at
}
BoostDependency {
int id PK
int client_library_id FK
int version_id FK
int dep_library_id FK
datetime created_at
}
DependencyChangeLog {
int id PK
int client_library_id FK
int dep_library_id FK
boolean is_add
date created_at "IX"
}
Note: BoostLibraryRepository extends GitHubRepository and adds created_at, updated_at (and any other app-specific fields); it inherits all repository fields from GitHubRepository.
Note: BoostFile extends GitHubFile and only adds library_id (inherits id, repo_id, filename, created_at, is_deleted from GitHubFile).
Note: Composite unique constraints should be applied on: (library_id, version_id) in BoostLibraryVersion, (client_library_id, version_id, dep_library_id) in BoostDependency, (client_library_id, dep_library_id, created_at) in DependencyChangeLog.
erDiagram
GitHubAccount ||--o{ BoostLibraryRoleRelationship : "role"
BoostLibraryVersion ||--o{ BoostLibraryRoleRelationship : "has"
BoostLibrary ||--o{ BoostLibraryCategoryRelationship : "has"
BoostLibraryCategory ||--o{ BoostLibraryCategoryRelationship : "category"
BoostLibraryRoleRelationship {
int id PK
int library_version_id FK
int account_id FK
boolean is_maintainer
boolean is_author
datetime created_at
datetime updated_at
}
BoostLibraryCategory {
int id PK
string name UK "IX"
datetime created_at
datetime updated_at
}
BoostLibraryCategoryRelationship {
int id PK
int library_id FK
int category_id FK
datetime created_at
datetime updated_at
}
Note: Composite unique constraints should be applied on: (library_version_id, account_id) in BoostLibraryRoleRelationship, (library_id, category_id) in BoostLibraryCategoryRelationship.
erDiagram
direction LR
GitHubRepository ||--o| BoostExternalRepository : "extend"
BoostExternalRepository ||--o{ BoostUsage : "has"
BoostUsage }o--|| "BoostFile (defined in Boost Library Tracker)" : "Boost header file"
BoostUsage }o--|| "GitHubFile (defined in GitHub Activity Tracker)" : "current file path"
BoostUsage ||--o{ BoostMissingHeaderTmp : "temporary missing header"
BoostExternalRepository {
string boost_version "IX"
boolean is_boost_embedded
boolean is_boost_used
datetime created_at
datetime updated_at
}
BoostUsage {
int id PK
int repo_id FK
BigInt boost_header_id FK "Nullable"
BigInt file_path_id FK
datetime last_commit_date "IX"
date excepted_at
datetime created_at
datetime updated_at
}
BoostMissingHeaderTmp {
int id PK
int usage_id FK "references BoostUsage.id"
string header_name
datetime created_at
}
Note: BoostMissingHeaderTmp temporarily stores usage history when the Boost include path (header_name) does not yet exist in the Boost/GitHub file tables (e.g. BoostFile or GitHubFile). usage_id references BoostUsage.id. Once the header is added to the catalog, these records can be processed (e.g. backfilled into BoostUsage with a resolved boost_header_id) and optionally removed.
Note: BoostExternalRepository extends GitHubRepository and only adds boost_version, is_boost_embedded, is_boost_used, created_at, updated_at. Repository identity and metadata (e.g. owner, repo_name, stars, forks, description, repo_pushed_at, repo_created_at, repo_updated_at) are inherited from GitHubRepository.
Note: BoostUsage links each external repository to a Boost header file and to the file path where it is used: boost_header_id references BoostFile (defined in Boost Library Tracker; extends GitHubFile, only adds library_id) for the Boost header; file_path_id references GitHubFile (defined in GitHub Activity Tracker) for the current file path in that repo. This tracks which external repos use which Boost files and in which files they appear.
Note: A composite unique constraint should be applied on (repo_id, boost_header_id, file_path_id) in BoostUsage.
Note: BoostMissingHeaderTmp.usage_id references BoostUsage.id (FK). Consider an index on usage_id and on header_name for lookups and backfill.
erDiagram
direction LR
MailingListProfile ||--o{ MailingListMessage : "sender"
MailingListMessage {
int id PK
int sender_id FK
string msg_id UK "IX"
string parent_id "IX"
string thread_id "IX"
string subject
text content
string list_name "IX"
datetime sent_at "IX"
datetime created_at
}
Note: MailingListProfile extends BaseProfile (section 1) and represents the mailing list user/account. sender_id in MailingListMessage references this profile.
erDiagram
direction LR
SlackTeam ||--o{ SlackChannel : "has"
SlackChannel ||--o{ SlackChannelMembershipChangeLog : "has"
SlackChannel ||--o{ SlackChannelMembership : "has"
SlackChannel ||--o{ SlackMessage : "contains"
SlackUser ||--o{ SlackMessage : "author"
SlackUser ||--o{ SlackChannelMembership : "member"
SlackUser ||--o{ SlackChannelMembershipChangeLog : "user"
SlackChannel }o--|| SlackUser : "creator"
SlackTeam {
int id PK
string team_id UK "IX"
string team_name
datetime created_at
datetime updated_at
}
SlackChannel {
int id PK
int team_id FK
string channel_id UK "IX"
string channel_name "IX"
string channel_type
text description
string creator_user_id
datetime created_at
datetime updated_at
}
SlackChannelMembershipChangeLog {
int id PK
int channel_id FK
string slack_user_id "IX"
boolean is_joined
datetime created_at
}
SlackChannelMembership {
int id PK
int channel_id FK
string slack_user_id "IX"
boolean is_restricted
boolean is_deleted
datetime created_at
datetime updated_at
}
SlackMessage {
bigint id PK
int channel_id FK
string ts UK "IX"
string slack_user_id "IX"
text message
string thread_ts "IX"
datetime slack_message_created_at "IX"
datetime slack_message_updated_at "IX"
}
Note: SlackUser extends BaseProfile (section 1). SlackMessage, SlackChannelMembership, and SlackChannelMembershipChangeLog reference users via slack_user_id; SlackChannel references the channel creator (SlackUser) via creator_user_id.
Note: Composite unique constraints should be applied on: (channel_id, ts) in SlackMessage, (channel_id, slack_user_id, created_at) in SlackChannelMembershipChangeLog, (channel_id, slack_user_id) in SlackChannelMembership.
erDiagram
Direction LR
WG21PaperAuthorProfile ||--o{ WG21PaperAuthor : "author"
WG21Mailing ||--o{ WG21Paper : "has"
WG21PaperAuthor }o--|| WG21Paper : "has"
WG21PaperAuthor {
int id PK
int paper_id FK
int profile_id FK
int author_order
datetime created_at
}
WG21Mailing {
int id PK
string mailing_date UK "IX"
string title
datetime created_at
datetime updated_at
}
WG21Paper {
int id PK
string paper_id "IX"
int year "IX"
string url
string title "IX"
date document_date "IX"
int mailing_id FK "IX"
string subgroup "IX"
boolean is_downloaded "IX"
datetime created_at
datetime updated_at
}
Note: WG21PaperAuthorProfile extends BaseProfile (section 1). profile_id in WG21PaperAuthor references this profile; each paper can have multiple authors.
Note: WG21Mailing stores information about the mailing release, identified by mailing_date (e.g. "2025-03"). mailing_id in WG21Paper references this mailing.
Note: WG21Paper is uniquely identified by the composite (paper_id, year); paper_id is not globally unique. The same paper identifier may appear in different years (e.g. revisions).
Note: Composite unique constraint should be applied on (paper_id, profile_id) in WG21PaperAuthor. author_order is optional and 1-based; it indicates the order of authors on the paper.
erDiagram
Website {
int id PK
date stat_date UK "IX"
int website_visit_count
}
WebsiteVisitCount {
int id PK
date stat_date "IX"
string country "IX"
int count
}
WebsiteWordCount {
int id PK
date stat_date "IX"
string word "IX"
int count
}
Note: Website stores daily site visit totals by stat_date. WebsiteVisitCount and WebsiteWordCount store per-country and per-word statistics; both reference the same stat_date for aggregation.
Note: Composite unique constraints should be applied on: (stat_date, country) in WebsiteVisitCount, (stat_date, word) in WebsiteWordCount.
erDiagram
PineconeFailList {
int id PK
string failed_id "IX"
string app_type "IX"
datetime created_at
}
PineconeSyncStatus {
int id PK
string app_type UK "IX"
datetime final_sync_at
datetime created_at
datetime updated_at
}
Note: PineconeFailList - Records failed sync operations by failed_id and app_type for retry or audit.
Note: PineconeSyncStatus - Tracks the last successful sync per app. One row per app_type. final_sync_at is when the last sync for that type completed; created_at and updated_at are for the row.
erDiagram
BoostLibraryVersion ||--o{ BoostLibraryDocumentation : "has"
BoostVersion ||--o{ BoostDocContent : "has"
BoostDocContent ||--o{ BoostLibraryDocumentation : "used_in"
BoostDocContent {
int id PK
text url "IX"
string content_hash UK "IX"
int first_version_id FK
int last_version_id FK
boolean is_upserted
datetime scraped_at
datetime created_at
}
BoostLibraryDocumentation {
int id PK
int boost_library_version_id FK
int boost_doc_content_id FK
datetime created_at
}
Note: BoostDocContent stores one globally unique scraped page per content hash. One row per unique content_hash regardless of version or library. Page content is not stored in the DB; it is kept in workspace files. content_hash (SHA-256 of the page text) is the unique key — the same URL may produce a new row if the content changes. first_version_id / last_version_id track the earliest and latest Boost version in which this page content was observed. is_upserted tracks whether the page has been successfully upserted to Pinecone. scraped_at is updated each time the page is re-fetched.
Note: BoostLibraryDocumentation is the join table between BoostLibraryVersion (section 3) and BoostDocContent. One row per (library-version, page) pair — it records which pages were found under a given (library, version) combination.
Note: Unique constraint on content_hash in BoostDocContent. Composite unique constraint on (boost_library_version_id, boost_doc_content_id) in BoostLibraryDocumentation. Index on boost_library_version_id in BoostLibraryDocumentation for efficient per-library-version queries.
Stores YouTube video metadata, VTT transcripts, speaker links, and community tags for C++ conference talks (CppCon, C++Now, Meeting C++, etc.).
YouTubeChannel— publisher channel;channel_idis the primary key.YouTubeVideo— video metadata and transcript state;video_idis the primary key.YouTubeVideoSpeaker— M2M join betweenYouTubeVideoandcppa_user_tracker.YoutubeSpeaker.CppaTags— C++ community tag vocabulary (e.g.concurrency,templates,modules).YouTubeVideoTags— M2M join betweenYouTubeVideoandCppaTags.
Workspace layout:
workspace/
├── cppa_youtube_script_tracker/
│ └── metadata/{video_id}.json # short-lived queue; moved to raw after DB persist
└── raw/
└── cppa_youtube_script_tracker/
├── metadata/{video_id}.json # permanent archive
└── transcripts/{video_id}.en.vtt # permanent archive
erDiagram
direction TB
YoutubeSpeaker ||--o{ YouTubeVideoSpeaker : "appears_in"
YouTubeVideo ||--o{ YouTubeVideoSpeaker : "has"
YouTubeChannel ||--o{ YouTubeVideo : "hosts"
YouTubeVideo ||--o{ YouTubeVideoTags : "has"
CppaTags ||--o{ YouTubeVideoTags : "tagged_in"
YouTubeChannel {
string channel_id PK
string channel_title
datetime created_at
datetime updated_at
}
YouTubeVideo {
string video_id PK
string channel_id FK
string title
text description
datetime published_at "IX"
int duration_seconds
int view_count
int like_count
int comment_count
string search_term
bool has_transcript
string transcript_path
datetime scraped_at
datetime created_at
datetime updated_at
}
YouTubeVideoSpeaker {
int id PK
string video_id FK
int speaker_id FK
datetime created_at
}
CppaTags {
int id PK
string tag_name "UK IX"
}
YouTubeVideoTags {
int id PK
string youtube_video_id FK
int cppa_tag_id FK
}
YoutubeSpeaker {
int baseprofile_ptr_id PK
string display_name "IX"
}
Note: YoutubeSpeaker is defined in cppa_user_tracker (section 1) and extends BaseProfile. It is identified solely by display_name (same pattern as MailingListProfile and WG21PaperAuthorProfile).
Note: YouTubeVideoSpeaker has a unique constraint on (video, speaker).
Note: YouTubeVideoTags has a unique constraint on (youtube_video, cppa_tag). CppaTags.tag_name values are stored lowercase.
Guilds, channels, messages, and reactions ingested from DiscordChatExporter JSON (see service_api/discord_activity_tracker.md). Discord user rows live in cppa_user_tracker.DiscordProfile (extends BaseProfile, section 1); this app only stores server/channel/message/reaction tables.
erDiagram
direction LR
DiscordServer ||--o{ DiscordChannel : "has"
DiscordChannel ||--o{ DiscordMessage : "contains"
DiscordProfile ||--o{ DiscordMessage : "author"
DiscordMessage ||--o{ DiscordReaction : "has"
DiscordServer {
bigint server_id "UK IX"
string server_name "IX"
string icon_url
datetime created_at
datetime updated_at
}
DiscordChannel {
int id PK
int server_id FK
bigint channel_id "UK IX"
string channel_name "IX"
string channel_type
bigint category_id "IX nullable"
string category_name
text topic
int position
datetime last_synced_at "IX"
datetime last_activity_at "IX"
datetime created_at
datetime updated_at
}
DiscordMessage {
int id PK
int channel_id FK
int author_id FK
bigint message_id "UK IX"
text content
string message_type "IX default Default"
boolean is_pinned "IX"
datetime message_created_at "IX"
datetime message_edited_at
boolean is_deleted "IX"
datetime deleted_at
bigint reply_to_message_id "IX nullable"
boolean has_attachments
json attachment_urls
datetime created_at
datetime updated_at
}
DiscordReaction {
int id PK
int message_id FK
string emoji "IX"
int count
datetime created_at
datetime updated_at
}
DiscordProfile {
int baseprofile_ptr_id PK "FK BaseProfile"
bigint discord_user_id "UK IX"
string username "IX"
}
Note: DiscordServer is keyed by Discord guild snowflake server_id (unique). DiscordChannel is keyed by channel_id (unique); server_id FK uses db_column="server_id" to the parent server’s PK id (Django default), not the snowflake — join in ORM via channel.server.server_id when you need the guild snowflake.
Note: DiscordMessage is keyed by message_id (Discord snowflake, unique). author_id → DiscordProfile (cppa_user_tracker). reply_to_message_id stores the parent message snowflake when the message is a reply (no FK to another row). message_type and is_pinned mirror exporter metadata (migration 0005).
Note: DiscordReaction has a unique constraint on (message, emoji) (discord_activity_tracker_msg_emoji_uniq).
| Table | Description | Section |
|---|---|---|
| BaseProfile | Base table for profiles; extended by platform-specific profile tables. Has identity_id FK to Identity. |
1 |
| Identity | Top-level user/account; one identity can have multiple BaseProfiles. | 1 |
| Email addresses linked to BaseProfile (one profile, many emails). | 1 | |
| GitHubAccount | Profile for GitHub (user/org/enterprise); extends BaseProfile. | 1 |
| SlackUser | Profile for Slack; extends BaseProfile. | 1 |
| MailingListProfile | Profile for mailing list; extends BaseProfile. | 1 |
| WG21PaperAuthorProfile | Profile for WG21 paper authors; extends BaseProfile. | 1 |
| DiscordProfile | Discord user profile (cppa_user_tracker); extends BaseProfile. discord_user_id UK; used as DiscordMessage.author. |
1, 11 |
| TmpIdentity | Temporary identity for staging (CPPA User Tracker). | 1 |
| TempProfileIdentityRelation | Staging table: base_profile_id -> target_identity_id (CPPA User Tracker). | 1 |
| GitHubRepository | Repository metadata (owner, repo_name, stars, forks, etc.). Base table for repo subtypes. | 2 |
| GitHubFile | File in a repo (filename, repo_id, is_deleted). Base for file subtypes. | 2 |
| Language | Reference: language name. | 2 |
| CreatedReposByLanguage | Yearly repository counts by language (all_repos, significant_repos; unique on language_id + year). |
2 |
| License | Reference: license name, spdx_id, url. | 2 |
| RepoLanguage | Repo-language link with line_count. | 2 |
| RepoLicense | Repo-license link. | 2 |
| GitCommit | Commit in a repo (hash, committer, comment, commit_at). | 2 |
| GitCommitFileChange | Per-commit file change (links commit, GitHubFile, status, additions, deletions, patch). | 2 |
| Issue | GitHub issue (repo, creator, number, title, body, state, labels, assignees). | 2 |
| IssueComment | Comment on an issue. | 2 |
| IssueAssignee | Issue-assignee link. | 2 |
| IssueLabel | Issue-label name. | 2 |
| PullRequest | PR (repo, creator, number, title, body, state, head_hash, base_hash, dates). | 2 |
| PullRequestReview | Review on a PR. | 2 |
| PullRequestComment | Comment on a PR. | 2 |
| PullRequestAssignee | PR-assignee link. | 2 |
| PullRequestLabel | PR-label name. | 2 |
| ClangGithubIssueItem | Clang mirror: one row per issue/PR number (no FKs); GitHub timestamps + Django updated_at for Pinecone incrementality. |
2b |
| ClangGithubCommit | Clang mirror: one row per commit SHA (no FKs); github_committed_at for fetch watermark. |
2b |
| BoostLibraryRepository | Extends GitHubRepository; adds created_at, updated_at (Boost repos). | 3 |
| BoostLibrary | Library within a Boost repo (name). | 3 |
| BoostFile | Extends GitHubFile; adds library_id (file in a Boost library). | 3 |
| BoostVersion | Reference: Boost version string. | 3 |
| BoostLibraryVersion | Library-version link (cpp_version, description). | 3 |
| BoostDependency | Library dependency (client_library, version, dep_library). | 3 |
| DependencyChangeLog | Log of dependency add/remove (client_library, dep_library, is_add, created_at). | 3 |
| BoostLibraryRoleRelationship | Library version-account link (maintainer/author). | 3 |
| BoostLibraryCategory | Reference: category name. | 3 |
| BoostLibraryCategoryRelationship | Library-category link. | 3 |
| BoostExternalRepository | Extends GitHubRepository; adds boost_version, is_boost_embedded, is_boost_used. | 4 |
| BoostUsage | External repo use of Boost (repo, boost_header_id, file_path_id, last_commit_date). | 4 |
| BoostMissingHeaderTmp | Temporary usage records when header_name is not yet in BoostFile/GitHubFile (usage_id→BoostUsage.id). | 4 |
| MailingListMessage | Mailing list message (sender_id->MailingListProfile, msg_id, subject, content, list_name, sent_at). | 5 |
| SlackTeam | Slack workspace (team_id, team_name). | 6 |
| SlackChannel | Channel in a team (channel_id, name, type, creator_user_id). | 6 |
| SlackMessage | Message in a channel (ts, slack_user_id, message, thread_ts). | 6 |
| SlackChannelMembership | Channel-member link (slack_user_id, is_restricted, is_deleted). | 6 |
| SlackChannelMembershipChangeLog | Log of join/leave (slack_user_id, is_joined, created_at). | 6 |
| WG21Paper | WG21 paper (paper_id, url, title, publication_date). | 7 |
| WG21PaperAuthor | Paper-author link (paper_id, profile_id->WG21PaperAuthorProfile). | 7 |
| Website | Daily site visit total (stat_date, website_visit_count). | 8 |
| WebsiteVisitCount | Per-date, per-country visit count. | 8 |
| WebsiteWordCount | Per-date, per-word count. | 8 |
| PineconeFailList | Failed sync records (failed_id, type) for retry/audit. | 9 |
| PineconeSyncStatus | Last sync per type (app_type, final_sync_at, …); includes Discord when PINECONE_DISCORD_APP_TYPE is set. |
9 |
| YoutubeSpeaker | Profile for YouTube speakers; extends BaseProfile. Identified by display_name. |
1, 10 |
| YouTubeChannel | Publisher channel; channel_id is PK (no auto-increment id). |
10 |
| YouTubeVideo | Video metadata, transcript state, and channel FK; video_id is PK (no auto-increment id). |
10 |
| YouTubeVideoSpeaker | M2M join between YouTubeVideo and YoutubeSpeaker (video_id, speaker_id). | 10 |
| CppaTags | C++ community tag vocabulary (tag_name, unique/lowercase). | 10 |
| YouTubeVideoTags | M2M join between YouTubeVideo and CppaTags (youtube_video_id, cppa_tag_id). | 10 |
| DiscordServer | Discord guild (server_id snowflake UK, name, icon). |
11 |
| DiscordChannel | Channel in a guild (channel_id UK, type, category, topic, sync/activity timestamps). | 11 |
| DiscordMessage | Message (message_id UK, content, type, pin, reply_to, attachments JSON, soft-delete flags). |
11 |
| DiscordReaction | Emoji aggregate per message (unique on message + emoji). | 11 |
| BoostDocContent | Globally unique scraped page by content hash (url, content_hash UK, first_version_id, last_version_id, is_upserted, scraped_at). One row per unique content hash across all versions. | 10 |
| BoostLibraryDocumentation | Join table: BoostLibraryVersion × BoostDocContent. Records which pages belong to each (library, version) pair. | 10 |
| From | To | Relationship |
|---|---|---|
| Identity | BaseProfile | One identity has many profiles |
| BaseProfile | One profile has many emails | |
| BaseProfile | GitHubAccount, SlackUser, MailingListProfile, WG21PaperAuthorProfile, DiscordProfile, YoutubeSpeaker | Extends (1:1 subtype) |
| TmpIdentity | TempProfileIdentityRelation | Has many (target) |
| TempProfileIdentityRelation | BaseProfile | Has many (base_profile_id) |
| GitHubAccount | GitHubRepository | Owns many |
| GitHubRepository | RepoLanguage, RepoLicense | Has many |
| Language | CreatedReposByLanguage | Has many yearly stats |
| GitHubRepository | BoostLibraryRepository, BoostExternalRepository | Extends (1:1 subtype) |
| GitHubRepository | GitCommit, Issue, PullRequest | Contains many |
| GitHubRepository | GitHubFile | Has many |
| GitHubFile | BoostFile | Extends (1:1 subtype) |
| GitHubFile | GitCommitFileChange | Changed in (many file changes) |
| GitCommit | GitCommitFileChange | Has many |
| Issue | IssueComment, IssueAssignee, IssueLabel | Has many |
| PullRequest | PullRequestReview, PullRequestComment, PullRequestAssignee, PullRequestLabel | Has many |
| GitHubAccount | GitCommit, Issue, IssueComment, IssueAssignee, PullRequest, PullRequestReview, PullRequestComment, PullRequestAssignee | Committer/creator/author/assignee/reviewer |
| BoostLibraryRepository | BoostLibrary | Has many |
| BoostLibrary | BoostFile, BoostDependency (client/dep), BoostLibraryVersion, DependencyChangeLog | Has many |
| BoostLibrary | BoostLibraryCategoryRelationship | Has many |
| BoostVersion | BoostDependency, BoostLibraryVersion | Version |
| BoostLibraryVersion | BoostLibraryRoleRelationship | Has many |
| GitHubAccount | BoostLibraryRoleRelationship | Role (maintainer/author) |
| BoostLibraryCategory | BoostLibraryCategoryRelationship | Category |
| BoostExternalRepository | BoostUsage | Has many |
| BoostUsage | BoostFile, GitHubFile | References (boost header, file path) |
| BoostUsage | BoostMissingHeaderTmp | Has many (temporary missing-header records) |
| MailingListProfile | MailingListMessage | Sender (has many messages) |
| SlackTeam | SlackChannel | Has many |
| SlackChannel | SlackMessage, SlackChannelMembership, SlackChannelMembershipChangeLog | Contains / has many |
| SlackUser | SlackMessage, SlackChannelMembership, SlackChannelMembershipChangeLog | Author / member / user |
| SlackChannel | SlackUser | Creator (many-to-one) |
| WG21PaperAuthorProfile | WG21PaperAuthor | Author (has many) |
| WG21Paper | WG21PaperAuthor | Has many authors |
| YoutubeSpeaker | YouTubeVideoSpeaker | Appears in (many videos) |
| YouTubeChannel | YouTubeVideo | Hosts many videos |
| YouTubeVideo | YouTubeVideoSpeaker | Has many speakers |
| YouTubeVideo | YouTubeVideoTags | Has many tags |
| CppaTags | YouTubeVideoTags | Tagged in many videos |
| DiscordServer | DiscordChannel | Has many channels |
| DiscordChannel | DiscordMessage | Contains many messages |
| DiscordProfile | DiscordMessage | Author (has many messages) |
| DiscordMessage | DiscordReaction | Has many reactions |
| BoostLibraryVersion | BoostLibraryDocumentation | Has many (boost_library_version_id) |
| BoostDocContent | BoostLibraryDocumentation | Used in many (boost_doc_content_id) |