Skip to content

Add datasets: query_imdb, query_krama, and query_civic_unstructured#46

Open
Ruiying-Ma wants to merge 23 commits into
mainfrom
ruiying-datasets-only
Open

Add datasets: query_imdb, query_krama, and query_civic_unstructured#46
Ruiying-Ma wants to merge 23 commits into
mainfrom
ruiying-datasets-only

Conversation

@Ruiying-Ma
Copy link
Copy Markdown
Collaborator

@Ruiying-Ma Ruiying-Ma commented May 5, 2026

Datasets Added

imdb

  • Domain: Movie/entertainment industry (IMDB JOB benchmark)
  • Databases: PostgreSQL (movies_db) + SQLite (people.sqlite)
  • Properties:
    • Multi-DB integration: queries span PostgreSQL and SQLite
    • Ill-formatted join keys: identifier columns use non-standard string encodings (e.g. tt0000042, nm001, InfT~~7); agents must strip prefixes and leading zeros before
      joining

krama

  • Domain: Multi-domain scientific data (wildfire, environment, legal, biomedical, astronomy, archeology)
  • Databases: MongoDB (domain_docs.bson for csv/txt files) + SQLite (us_geo.db for gpkg files and a synthetic beach_water_temperature table) + SQLite (domain_assets.db for other files)
  • Properties:
    • Multi-DB integration: queries span MongoDB, SQLite, and heterogeneous flat files
    • Ill-formatted join keys: beach codes: plain integers vs 'COM-XXX' prefix; UCEC patient IDs: 'S001' vs 's_1'
    • Unstructured-text-transformation: GACC / location / doc info merged into NL description field

civic_unstructured

  • Domain: civic agenda reports
  • Databases: MongoDB (for reports), SQLite (for fundings data)
    • Properties:
      • Multi-DB integration: queries span MongoDB, SQLite, and heterogeneous flat files
      • Ill-formatted join keys: report names
      • Unstructured-text-transformation: parse meeting dates and projects' statuses and types from reports

Each dataset has 10 queries with ground-truth CSVs and validate.py scripts.

@Ruiying-Ma Ruiying-Ma changed the title Ruiying datasets only Add datasets query_imdb and query_krama May 6, 2026
@Ruiying-Ma Ruiying-Ma changed the title Add datasets query_imdb and query_krama Add datasets: query_imdb, query_krama, and query_civic_unstructured May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant