betydata R data package with BETYdb public data export#12
betydata R data package with BETYdb public data export#12divine7022 wants to merge 24 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR delivers the initial release (v0.1.0) of betydata, an R data package providing offline access to public data from the BETYdb (Biofuel Ecophysiological Traits and Yields) database. The package enables reproducible analyses of plant traits and crop yields without requiring database connectivity.
Changes:
- Complete R package structure with 16 datasets (traitsview + 15 support tables) totaling 43,532+ trait and yield records
- Multiple data formats: lazy-loaded .rda files, Parquet alternatives, and Frictionless metadata (datapackage.json)
- Comprehensive documentation: roxygen2 docs for all datasets, 4 vignettes (orientation, sql-analogs, pfts-priors, manuscript), and GitHub issue templates
- Quality controls: excludes checked=-1 records, public data only (access_level >= 4), full test coverage
- CI/CD infrastructure: GitHub Actions R-CMD-check workflow, testthat 3.0 test suite
Reviewed changes
Copilot reviewed 38 out of 71 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| DESCRIPTION | Package metadata and dependencies; minor email format issue |
| CITATION.cff | Citation metadata; email and missing preferred-citation issues |
| LICENSE | BSD-3-Clause license file |
| README.md | Comprehensive package documentation; table formatting issue |
| NEWS.md | Release notes documenting v0.1.0 |
| R/betydata-package.R | Package-level documentation |
| R/data.R | Roxygen2 documentation for all 16 datasets |
| man/*.Rd | Generated documentation files for datasets |
| vignettes/*.Rmd | Four tutorial vignettes; minor issues in manuscript.Rmd and pfts-priors.Rmd |
| tests/testthat/*.R | Test suite for data and metadata validation; deprecated context() calls |
| data-raw/make-data.R | Data build script for generating .rda and Parquet files |
| inst/metadata/datapackage.json | Frictionless Data package metadata |
| inst/extdata/parquet/*.parquet | Sample Parquet data files |
| data/*.rda | Binary R data files (compressed with xz) |
| .github/workflows/*.yaml | GitHub Actions CI configuration |
| .github/ISSUE_TEMPLATE/*.md | Issue templates for data corrections and verifications |
| .gitignore, .Rbuildignore | Build and version control configuration; CSV exclusion concern |
Comments suppressed due to low confidence (2)
tests/testthat/test-metadata.R:3
- The
context()function on line 3 is deprecated in testthat 3.0.0 and later. According to the DESCRIPTION file, this package usestestthat (>= 3.0.0)and hasConfig/testthat/edition: 3. The context() calls should be removed as they are no longer needed and will generate warnings.
tests/testthat/test-data.R:3 - The
context()function on line 3 is deprecated in testthat 3.0.0 and later. According to the DESCRIPTION file, this package usestestthat (>= 3.0.0)and hasConfig/testthat/edition: 3. The context() calls should be removed as they are no longer needed and will generate warnings.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| 1. betydata excludes `checked = -1` (failed QA/QC records) | ||
| 2. Snapshot date: betydata was exported on `r format(Sys.Date(), "%Y-%m-%d")`; the manuscript used 2017 data | ||
| 3. Access level filtering: betydata includes only public data (`access_level < 4`) |
There was a problem hiding this comment.
The access level comparison note contains an error. The text states "access_level < 4" but according to the README and elsewhere in the code, the package includes only public data where "access_level >= 4" (not less than 4). This is the opposite condition and needs to be corrected to ">= 4".
| | Dataset | Rows | Columns | Description | | ||
| |---------------|--------|---------|----------------------------------------------| | ||
| | `traitsview` | 43,532 | 36 | Denormalized view of plant traits and yields | | ||
| | Dataset | Description | | ||
| |---------------|---------------------------------------------------------------| | ||
| | `species` | Plant taxonomy (genus, species, common names) | | ||
| | `sites` | Research site locations with coordinates and climate data | | ||
| | `variables` | Trait/variable definitions, units, and valid ranges | | ||
| | `citations` | Literature references (author, year, title, DOI) | | ||
| | `cultivars` | Plant cultivar and variety information | | ||
| | `treatments` | Experimental treatment definitions | | ||
| | `managements` | Management events (planting, harvest, fertilization) | | ||
| | `methods` | Measurement method descriptions | | ||
| | `pfts` | Plant Functional Type definitions for ecological modeling | | ||
| | `priors` | Prior probability distributions for Bayesian analysis | | ||
| | `entities` | Entity identifiers for repeated measures | |
There was a problem hiding this comment.
The README contains a malformed table structure. Lines 31-33 show a table header for the Primary Dataset, but then lines 34-46 continue with a different table that has incompatible headers (missing "Rows" and "Columns" columns). This creates a broken table rendering. The support tables section should have its own separate table header.
| if (length(sla_data) > 10 && exists("x") && exists("y")) { | ||
| # Create plot comparing prior to histogram of data | ||
| ggplot() + | ||
| geom_histogram( | ||
| data = data.frame(sla = sla_data), | ||
| aes(x = sla, y = after_stat(density)), | ||
| bins = 30, fill = "steelblue", alpha = 0.6 | ||
| ) + | ||
| geom_line( | ||
| data = data.frame(x = x, y = y), | ||
| aes(x, y), | ||
| color = "red", linewidth = 1, linetype = "dashed" | ||
| ) + | ||
| labs( | ||
| title = "SLA: Prior Distribution vs. Observed Data", | ||
| subtitle = "Red dashed = prior, Blue = observed data (Miscanthus + Panicum)", | ||
| x = "SLA (m2/kg)", | ||
| y = "Density" | ||
| ) + | ||
| xlim(0, 80) | ||
| } |
There was a problem hiding this comment.
The code at line 200 checks for exists("x") && exists("y") but these variables (x and y) are created within a previous code chunk that only executes conditionally (if (nrow(sla_priors) > 0)). This creates a fragile dependency where the plot will only render if both the SLA priors exist AND the earlier chunk successfully created x and y variables. This code should either store x and y in a way that persists across chunks or restructure the logic to avoid this cross-chunk dependency.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
dlebauer
left a comment
There was a problem hiding this comment.
I've done a quick first review. On a future review I will go through all of the vignettes and explore the tables as they exist.
I am now wondering if we should 1) store the data in CSV files to allow text-based version control and 2) if we can reconstruct traitsview on the fly from the component datasets (i.e. traitsview should not be in data_raw)
| "path": "https://doi.org/10.1111/gcbb.12420" | ||
| } | ||
| ], | ||
| "resources": [ |
There was a problem hiding this comment.
Here it looks like only the traits view dataset has enumerated fields - is that intentional?
|
|
||
| ## Available Datasets | ||
|
|
||
| The package exports 16 datasets. List them all: |
There was a problem hiding this comment.
more correct to call betydata a dataset with multiple tables, rather than referring to each table as a 'dataset'
| names(traitsview) | ||
| ``` | ||
|
|
||
| ### Key Columns |
There was a problem hiding this comment.
I propose that we re-organize traitsview a bit so that the key cols are first, and the ids are all to the right. The goal is to make it easier on end users.
|
|
||
| ```{r basic-exploration} | ||
| # Preview | ||
| head(traitsview[, c("trait", "mean", "units", "scientificname", "author")]) |
There was a problem hiding this comment.
If we put the key cols first and use tibbles, then the preview could simply be:
traitsview| table(traitsview$checked, useNA = "ifany") | ||
|
|
||
| # Work with verified records only | ||
| verified <- traitsview[traitsview$checked == 1, ] |
There was a problem hiding this comment.
lets consistently use dplyr verbs. They are easier to read
verified <- traitsview |>
filter(checked == 1)
|
|
||
| ### Relationship Tables | ||
|
|
||
| | Dataset | Description | |
There was a problem hiding this comment.
| | Dataset | Description | | |
| | Table | Description | |
| | `priors` | Prior probability distributions for Bayesian analysis | | ||
| | `entities` | Entity identifiers for repeated measures | | ||
|
|
||
| ### Relationship Tables |
There was a problem hiding this comment.
Briefly explain - how are these used?
| library(dplyr) | ||
| traitsview |> count(trait, sort = TRUE) | ||
|
|
||
| # Count by genus (top bioenergy crops) |
There was a problem hiding this comment.
these won't be limited to bioenergy crops since they are not filtered
| traitsview |> count(trait, sort = TRUE) | ||
|
|
||
| # Count by genus (top bioenergy crops) | ||
| traitsview |> count(genus, sort = TRUE) |> head(10) |
There was a problem hiding this comment.
I prefer to have a new line after each |>
| traitsview |> count(genus, sort = TRUE) |> head(10) | |
| traitsview |> | |
| count(genus, sort = TRUE) |
And then rely on the default printing behavior of tibbles to summarize the tables.
|
|
||
| **Note:** This package exports only `checked >= 0` data. Flagged records (`checked = -1`) are excluded during data preparation. For research requiring unchecked data, access the BETYdb PostgreSQL database directly. | ||
|
|
||
| ### Access Levels |
There was a problem hiding this comment.
I think we can remove the access_level columns and all references to the 'access_level' other than to say once that this package includes all public data from BETYdb
Summary
Initial release of
betydata, an R data package providing offline access topublic data from BETYdb
traitsview(43,532 rows) + 15 reference tablesaccess_level = 4,checked >= 0)Vignettes
orientation: Package overview and data relationshipssql-analogs: Migrate BETYdb SQL queries to dplyrpfts-priors: Working with PFTs and Bayesian priorsmanuscript: Reproduce LeBauer et al. (2018) analysesDatasets
implements #1, #2, #3, #4, #5, #6, #7, #8, #9, #10, #11