Add MIT ML Data Guide and US Data.gov sources #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

yujiawei wants to merge 4 commits into MLT-OSS:main from yujiawei:main

firstdata/sources/academic/ai-ml/README.md

-Original file line number
+Diff line change
@@ -0,0 +1,135 @@
+    # 人工智能与机器学习 | AI & Machine Learning
+    **总数**: 15+个数据源
+    **已完成**: 12个
+    **进度**: 80%
+    ---
+    ## 📊 总体进度
+    ```
+    总目标: 15+ 个高质量 AI/ML 数据源
+    当前完成: 12 个
+    完成度: ████████░░ 80%
+    ```
+    ---
+    ## 📚 已收录数据源
+    ### 🗂️ 数据集平台 (4个)
+    #### Hugging Face Datasets
+    - **文件**: [huggingface-datasets.json](huggingface-datasets.json) ⭐💎
+    - **权威等级**: industry
+    - **类型**: 数据集平台、模型训练数据、基准测试
+    - **涵盖**: 全球，2020-至今，100,000+数据集
+    - **更新频率**: 持续
+    - **特色**: AI/ML社区事实标准，支持NLP、CV、音频、多模态
+    #### Kaggle Datasets
+    - **文件**: [kaggle-datasets.json](kaggle-datasets.json) ⭐💎
+    - **权威等级**: industry
+    - **类型**: 数据集、竞赛、Notebooks、模型
+    - **涵盖**: 全球，2010-至今，200,000+数据集
+    - **更新频率**: 持续
+    - **特色**: Google 旗下，1500万+用户，数据科学社区标准
+    #### UCI Machine Learning Repository
+    - **文件**: [uci-ml-repository.json](uci-ml-repository.json) ⭐💎
+    - **权威等级**: research
+    - **类型**: 数据集、基准
+    - **涵盖**: 全球，1987-至今，600+数据集
+    - **更新频率**: 月度
+    - **特色**: 30+年历史，10万+论文引用，ML基准黄金标准
+    #### OpenML
+    - **文件**: [openml.json](openml.json) ⭐💎
+    - **权威等级**: research
+    - **类型**: 数据集、基准、实验、ML流程
+    - **涵盖**: 全球，2013-至今，5,000+数据集，10M+实验
+    - **更新频率**: 持续
+    - **特色**: 可重复研究，scikit-learn集成，EU资助
+    ### 🔍 搜索与发现 (2个)
+    #### Google Dataset Search
+    - **文件**: [google-dataset-search.json](google-dataset-search.json) ⭐💎
+    - **权威等级**: industry
+    - **类型**: 搜索引擎、数据集、元数据
+    - **涵盖**: 全球，2018-至今，2500万+索引数据集
+    - **更新频率**: 持续
+    - **特色**: Google Research产品，跨平台发现
+    #### Papers With Code
+    - **文件**: [papers-with-code.json](papers-with-code.json) ⭐💎
+    - **权威等级**: research
+    - **类型**: 论文、代码、数据集、基准、SOTA结果
+    - **涵盖**: 全球，2012-至今，300,000+论文，6,000+基准
+    - **更新频率**: 每日
+    - **特色**: 追踪各任务SOTA，Meta AI收购，研究社区标准参考
+    ### ☁️ 云端大规模数据 (2个)
+    #### AWS Registry of Open Data
+    - **文件**: [aws-open-data.json](aws-open-data.json) ⭐💎
+    - **权威等级**: industry
+    - **类型**: 数据集、卫星、基因组、气候、地理空间
+    - **涵盖**: 全球，2017-至今，400+数据集，PB级
+    - **更新频率**: 持续
+    - **特色**: 与 NASA、NOAA、NIH 合作，免费云端访问
+    #### Microsoft Research Open Data
+    - **文件**: [microsoft-research-open-data.json](microsoft-research-open-data.json) ⭐💎
+    - **权威等级**: industry
+    - **类型**: 数据集、NLP、计算机视觉、社交网络
+    - **涵盖**: 全球，2018-至今，100+数据集
+    - **更新频率**: 季度
+    - **特色**: 微软研究院官方，广泛引用的基准数据集
+    ### 📄 学术论文与预印本 (2个)
+    #### arXiv 计算机科学
+    - **文件**: [arxiv-cs.json](arxiv-cs.json) ⭐💎
+    - **权威等级**: research
+    - **类型**: 预印本论文、研究
+    - **涵盖**: 全球，1991-至今，2.4M+论文
+    - **更新频率**: 每日
+    - **特色**: AI/ML研究首发平台，康奈尔大学运营
+    #### ACL Anthology
+    - **文件**: [acl-anthology.json](acl-anthology.json) ⭐💎
+    - **权威等级**: research
+    - **类型**: 论文、会议录、数据集
+    - **涵盖**: 全球，1965-至今，90,000+论文
+    - **更新频率**: 持续
+    - **特色**: NLP领域最权威，ACL/EMNLP/NAACL顶会论文
+    ### 🏆 基准与标准 (1个)
+    #### MLCommons
+    - **文件**: [mlcommons.json](mlcommons.json) ⭐💎
+    - **权威等级**: industry
+    - **类型**: 基准、数据集、标准
+    - **涵盖**: 全球，2018-至今，10+数据集，5+基准套件
+    - **更新频率**: 半年
+    - **特色**: MLPerf行业标准，Google/NVIDIA/Intel/Meta联合
+    ### 🗄️ 数据存档 (1个)
+    #### Zenodo (Machine Learning)
+    - **文件**: [zenodo-ml.json](zenodo-ml.json) ⭐💎
+    - **权威等级**: research
+    - **类型**: 数据集、代码、论文、模型
+    - **涵盖**: 全球，2013-至今，300万+记录
+    - **更新频率**: 持续
+    - **特色**: CERN运营，永久DOI，学术存档标准
+    ---
+    ## 🎯 待添加
+    - [ ] ImageNet (计算机视觉基准)
+    - [ ] Common Crawl (网页语料)
+    - [ ] The Pile (大型语言模型预训练)

firstdata/sources/academic/ai-ml/acl-anthology.json

-Original file line number
+Diff line change
@@ -0,0 +1,34 @@
+    {
+      "id": "acl-anthology",
+      "name": {
+        "zh": "ACL Anthology",
+        "en": "ACL Anthology"
+      },
+      "description": {
+        "zh": "计算语言学协会论文库，NLP 领域最权威的学术论文档案，包含 ACL、EMNLP、NAACL 等顶会论文",
+        "en": "Association for Computational Linguistics paper archive, most authoritative academic papers in NLP including ACL, EMNLP, NAACL"
+      },
+      "url": "https://aclanthology.org/",
+      "authority_level": "research",
+      "authority_justification": {
+        "zh": "计算语言学协会官方维护，NLP 领域所有顶级会议和期刊的唯一权威档案",
+        "en": "Officially maintained by ACL, the only authoritative archive for all top NLP conferences and journals"
+      },
+      "data_type": ["papers", "proceedings", "datasets"],
+      "coverage": {
+        "geographic": "global",
+        "temporal": "1965-present",
+        "indicators": "90,000+ papers"
+      },
+      "update_frequency": "continuous",
+      "access_method": {
+        "web": "https://aclanthology.org/",
+        "api": "https://aclanthology.org/anthology+abstracts.bib.gz",
+        "github": "https://github.com/acl-org/acl-anthology"
+      },
+      "license": "CC BY 4.0",
+      "languages": ["en"],
+      "tags": ["nlp", "computational-linguistics", "papers", "conferences", "emnlp", "acl"],
+      "verified": true,
+      "last_verified": "2026-02-02"
+    }

firstdata/sources/academic/ai-ml/arxiv-cs.json

-Original file line number
+Diff line change
@@ -0,0 +1,34 @@
+    {
+      "id": "arxiv-cs",
+      "name": {
+        "zh": "arXiv 计算机科学",
+        "en": "arXiv Computer Science"
+      },
+      "description": {
+        "zh": "康奈尔大学运营的开放获取预印本服务器，是 AI/ML 研究的首发平台，每天数百篇新论文",
+        "en": "Open-access preprint server operated by Cornell University, the primary venue for AI/ML research with hundreds of new papers daily"
+      },
+      "url": "https://arxiv.org/list/cs.AI/recent",
+      "authority_level": "research",
+      "authority_justification": {
+        "zh": "康奈尔大学运营，获得西蒙斯基金会等资助，是全球物理学和计算机科学的标准预印本平台",
+        "en": "Operated by Cornell University, funded by Simons Foundation, the standard preprint platform for physics and CS globally"
+      },
+      "data_type": ["papers", "preprints", "research"],
+      "coverage": {
+        "geographic": "global",
+        "temporal": "1991-present",
+        "indicators": "2.4M+ papers, 50,000+ CS.AI papers"
+      },
+      "update_frequency": "daily",
+      "access_method": {
+        "web": "https://arxiv.org",
+        "api": "https://info.arxiv.org/help/api/index.html",
+        "bulk": "https://info.arxiv.org/help/bulk_data.html"
+      },
+      "license": "varies by paper (mostly CC BY)",
+      "languages": ["en"],
+      "tags": ["artificial-intelligence", "machine-learning", "deep-learning", "nlp", "computer-vision", "preprints"],
+      "verified": true,
+      "last_verified": "2026-02-02"
+    }

firstdata/sources/academic/ai-ml/aws-open-data.json

-Original file line number
+Diff line change
@@ -0,0 +1,34 @@
+    {
+      "id": "aws-open-data",
+      "name": {
+        "zh": "AWS 开放数据注册表",
+        "en": "AWS Registry of Open Data"
+      },
+      "description": {
+        "zh": "Amazon 托管的大规模公开数据集，包括卫星图像、基因组、气候等 PB 级数据",
+        "en": "Amazon-hosted large-scale public datasets including satellite imagery, genomics, climate data at PB scale"
+      },
+      "url": "https://registry.opendata.aws/",
+      "authority_level": "industry",
+      "authority_justification": {
+        "zh": "AWS 官方维护，与 NASA、NOAA、NIH 等机构合作，提供免费云端访问",
+        "en": "Officially maintained by AWS, partnered with NASA, NOAA, NIH, provides free cloud access"
+      },
+      "data_type": ["datasets", "satellite", "genomics", "climate", "geospatial"],
+      "coverage": {
+        "geographic": "global",
+        "temporal": "2017-present",
+        "indicators": "400+ datasets, PB scale"
+      },
+      "update_frequency": "continuous",
+      "access_method": {
+        "web": "https://registry.opendata.aws/",
+        "aws_cli": "aws s3 ls s3://[dataset-bucket]",
+        "api": "S3 API"
+      },
+      "license": "varies by dataset",
+      "languages": ["en"],
+      "tags": ["cloud", "satellite", "genomics", "climate", "geospatial", "big-data"],
+      "verified": true,
+      "last_verified": "2026-02-02"
+    }

firstdata/sources/academic/ai-ml/google-dataset-search.json

-Original file line number
+Diff line change
@@ -0,0 +1,32 @@
+    {
+      "id": "google-dataset-search",
+      "name": {
+        "zh": "Google 数据集搜索",
+        "en": "Google Dataset Search"
+      },
+      "description": {
+        "zh": "Google 提供的数据集搜索引擎，索引全网 2500万+ 数据集，支持跨平台发现",
+        "en": "Google's dataset search engine, indexing 25M+ datasets across the web for cross-platform discovery"
+      },
+      "url": "https://datasetsearch.research.google.com/",
+      "authority_level": "industry",
+      "authority_justification": {
+        "zh": "Google Research 产品，基于 schema.org/Dataset 标准，覆盖主流数据平台",
+        "en": "Google Research product, based on schema.org/Dataset standard, covers major data platforms"
+      },
+      "data_type": ["search_engine", "datasets", "metadata"],
+      "coverage": {
+        "geographic": "global",
+        "temporal": "2018-present",
+        "indicators": "25,000,000+ indexed datasets"
+      },
+      "update_frequency": "continuous",
+      "access_method": {
+        "web": "https://datasetsearch.research.google.com/"
+      },
+      "license": "N/A (search engine)",
+      "languages": ["en", "zh", "multilingual"],
+      "tags": ["search-engine", "datasets", "discovery", "metadata", "cross-platform"],
+      "verified": true,
+      "last_verified": "2026-02-02"
+    }

firstdata/sources/academic/ai-ml/huggingface-datasets.json

-Original file line number
+Diff line change
@@ -0,0 +1,34 @@
+    {
+      "id": "huggingface-datasets",
+      "name": {
+        "zh": "Hugging Face Datasets",
+        "en": "Hugging Face Datasets"
+      },
+      "description": {
+        "zh": "全球最大的开源机器学习数据集平台，托管超过100,000个数据集，涵盖NLP、计算机视觉、音频、多模态等领域",
+        "en": "The world's largest open-source ML dataset platform, hosting 100,000+ datasets covering NLP, computer vision, audio, and multimodal domains"
+      },
+      "url": "https://huggingface.co/datasets",
+      "authority_level": "industry",
+      "authority_justification": {
+        "zh": "Hugging Face 是 AI/ML 社区的事实标准平台，获得超过 $400M 融资，与 Google、Amazon、Microsoft 等合作",
+        "en": "Hugging Face is the de facto standard platform for AI/ML community, raised $400M+, partnered with Google, Amazon, Microsoft"
+      },
+      "data_type": ["datasets", "model_training_data", "benchmarks"],
+      "coverage": {
+        "geographic": "global",
+        "temporal": "2020-present",
+        "indicators": "100,000+ datasets"
+      },
+      "update_frequency": "continuous",
+      "access_method": {
+        "web": "https://huggingface.co/datasets",
+        "api": "https://huggingface.co/docs/datasets/",
+        "python": "pip install datasets; from datasets import load_dataset"
+      },
+      "license": "varies by dataset",
+      "languages": ["en", "zh", "multilingual"],
+      "tags": ["machine-learning", "nlp", "computer-vision", "audio", "multimodal", "open-source"],
+      "verified": true,
+      "last_verified": "2026-02-02"
+    }

firstdata/sources/academic/ai-ml/kaggle-datasets.json

-Original file line number
+Diff line change
@@ -0,0 +1,34 @@
+    {
+      "id": "kaggle-datasets",
+      "name": {
+        "zh": "Kaggle Datasets",
+        "en": "Kaggle Datasets"
+      },
+      "description": {
+        "zh": "全球最大的数据科学竞赛平台，托管 200,000+ 公开数据集，涵盖各行业和研究领域",
+        "en": "World's largest data science competition platform, hosting 200,000+ public datasets across industries and research domains"
+      },
+      "url": "https://www.kaggle.com/datasets",
+      "authority_level": "industry",
+      "authority_justification": {
+        "zh": "Google 旗下平台，1500万+ 注册用户，数据科学社区事实标准",
+        "en": "Owned by Google, 15M+ registered users, de facto standard for data science community"
+      },
+      "data_type": ["datasets", "competitions", "notebooks", "models"],
+      "coverage": {
+        "geographic": "global",
+        "temporal": "2010-present",
+        "indicators": "200,000+ datasets, 50,000+ notebooks"
+      },
+      "update_frequency": "continuous",
+      "access_method": {
+        "web": "https://www.kaggle.com/datasets",
+        "api": "https://www.kaggle.com/docs/api",
+        "cli": "pip install kaggle; kaggle datasets download"
+      },
+      "license": "varies by dataset",
+      "languages": ["en"],
+      "tags": ["machine-learning", "data-science", "competitions", "notebooks", "tabular"],
+      "verified": true,
+      "last_verified": "2026-02-02"
+    }

firstdata/sources/academic/ai-ml/microsoft-research-open-data.json

-Original file line number
+Diff line change
@@ -0,0 +1,33 @@
+    {
+      "id": "microsoft-research-open-data",
+      "name": {
+        "zh": "微软研究院开放数据",
+        "en": "Microsoft Research Open Data"
+      },
+      "description": {
+        "zh": "微软研究院发布的研究数据集，涵盖 NLP、计算机视觉、社交网络等领域",
+        "en": "Research datasets released by Microsoft Research, covering NLP, computer vision, social networks and more"
+      },
+      "url": "https://msropendata.com/",
+      "authority_level": "industry",
+      "authority_justification": {
+        "zh": "微软研究院官方发布，包含多个被广泛引用的基准数据集",
+        "en": "Officially released by Microsoft Research, includes widely-cited benchmark datasets"
+      },
+      "data_type": ["datasets", "nlp", "computer_vision", "social_networks"],
+      "coverage": {
+        "geographic": "global",
+        "temporal": "2018-present",
+        "indicators": "100+ datasets"
+      },
+      "update_frequency": "quarterly",
+      "access_method": {
+        "web": "https://msropendata.com/",
+        "azure": "Azure Blob Storage"
+      },
+      "license": "varies by dataset (mostly research use)",
+      "languages": ["en"],
+      "tags": ["nlp", "computer-vision", "social-networks", "research", "microsoft"],
+      "verified": true,
+      "last_verified": "2026-02-02"
+    }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MIT ML Data Guide and US Data.gov sources #1

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!