betydata R data package with BETYdb public data export by divine7022 · Pull Request #12 · PecanProject/betydata

divine7022 · 2026-02-11T20:24:15Z

Summary

Initial release of betydata, an R data package providing offline access to
public data from BETYdb

16 datasets: traitsview (43,532 rows) + 15 reference tables
Multiple formats: .rda (lazy-loaded), Parquet, Frictionless datapackage.json
Filtered to public data only (access_level = 4, checked >= 0)
Complete roxygen2 documentation for all datasets
Package-level documentation with BETYdb context
Data quality policy in README (checked column, access levels)

Vignettes

orientation: Package overview and data relationships
sql-analogs: Migrate BETYdb SQL queries to dplyr
pfts-priors: Working with PFTs and Bayesian priors
manuscript: Reproduce LeBauer et al. (2018) analyses

Datasets

Dataset	Description
traitsview	Primary trait/yield observations (43,532 × 36)
species	Plant taxonomy
sites	Research site locations
variables	Trait definitions and units
citations	Literature references
pfts	Plant functional types
priors	Bayesian prior distributions
+ 9 more	Support and relationship tables

implements #1, #2, #3, #4, #5, #6, #7, #8, #9, #10, #11

Copilot

Pull request overview

This PR delivers the initial release (v0.1.0) of betydata, an R data package providing offline access to public data from the BETYdb (Biofuel Ecophysiological Traits and Yields) database. The package enables reproducible analyses of plant traits and crop yields without requiring database connectivity.

Changes:

Complete R package structure with 16 datasets (traitsview + 15 support tables) totaling 43,532+ trait and yield records
Multiple data formats: lazy-loaded .rda files, Parquet alternatives, and Frictionless metadata (datapackage.json)
Comprehensive documentation: roxygen2 docs for all datasets, 4 vignettes (orientation, sql-analogs, pfts-priors, manuscript), and GitHub issue templates
Quality controls: excludes checked=-1 records, public data only (access_level >= 4), full test coverage
CI/CD infrastructure: GitHub Actions R-CMD-check workflow, testthat 3.0 test suite

Reviewed changes

Copilot reviewed 38 out of 71 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
DESCRIPTION	Package metadata and dependencies; minor email format issue
CITATION.cff	Citation metadata; email and missing preferred-citation issues
LICENSE	BSD-3-Clause license file
README.md	Comprehensive package documentation; table formatting issue
NEWS.md	Release notes documenting v0.1.0
R/betydata-package.R	Package-level documentation
R/data.R	Roxygen2 documentation for all 16 datasets
man/*.Rd	Generated documentation files for datasets
vignettes/*.Rmd	Four tutorial vignettes; minor issues in manuscript.Rmd and pfts-priors.Rmd
tests/testthat/*.R	Test suite for data and metadata validation; deprecated context() calls
data-raw/make-data.R	Data build script for generating .rda and Parquet files
inst/metadata/datapackage.json	Frictionless Data package metadata
inst/extdata/parquet/*.parquet	Sample Parquet data files
data/*.rda	Binary R data files (compressed with xz)
.github/workflows/*.yaml	GitHub Actions CI configuration
.github/ISSUE_TEMPLATE/*.md	Issue templates for data corrections and verifications
.gitignore, .Rbuildignore	Build and version control configuration; CSV exclusion concern

Comments suppressed due to low confidence (2)

tests/testthat/test-metadata.R:3

The context() function on line 3 is deprecated in testthat 3.0.0 and later. According to the DESCRIPTION file, this package uses testthat (>= 3.0.0) and has Config/testthat/edition: 3. The context() calls should be removed as they are no longer needed and will generate warnings.
tests/testthat/test-data.R:3
The context() function on line 3 is deprecated in testthat 3.0.0 and later. According to the DESCRIPTION file, this package uses testthat (>= 3.0.0) and has Config/testthat/edition: 3. The context() calls should be removed as they are no longer needed and will generate warnings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-19T18:47:47Z

vignettes/manuscript.Rmd

+
+1. betydata excludes `checked = -1` (failed QA/QC records)
+2. Snapshot date: betydata was exported on `r format(Sys.Date(), "%Y-%m-%d")`; the manuscript used 2017 data
+3. Access level filtering: betydata includes only public data (`access_level < 4`)


The access level comparison note contains an error. The text states "access_level < 4" but according to the README and elsewhere in the code, the package includes only public data where "access_level >= 4" (not less than 4). This is the opposite condition and needs to be corrected to ">= 4".

CITATION.cff

DESCRIPTION

Copilot · 2026-02-19T18:47:48Z

README.md

+| Dataset       | Rows   | Columns | Description                                  |
+|---------------|--------|---------|----------------------------------------------|
+| `traitsview`  | 43,532 | 36      | Denormalized view of plant traits and yields |
+| Dataset       | Description                                                   |
+|---------------|---------------------------------------------------------------|
+| `species`     | Plant taxonomy (genus, species, common names)                 |
+| `sites`       | Research site locations with coordinates and climate data     |
+| `variables`   | Trait/variable definitions, units, and valid ranges           |
+| `citations`   | Literature references (author, year, title, DOI)              |
+| `cultivars`   | Plant cultivar and variety information                        |
+| `treatments`  | Experimental treatment definitions                            |
+| `managements` | Management events (planting, harvest, fertilization)          |
+| `methods`     | Measurement method descriptions                               |
+| `pfts`        | Plant Functional Type definitions for ecological modeling     |
+| `priors`      | Prior probability distributions for Bayesian analysis         |
+| `entities`    | Entity identifiers for repeated measures                      |


The README contains a malformed table structure. Lines 31-33 show a table header for the Primary Dataset, but then lines 34-46 continue with a different table that has incompatible headers (missing "Rows" and "Columns" columns). This creates a broken table rendering. The support tables section should have its own separate table header.

Copilot · 2026-02-19T18:47:48Z

vignettes/pfts-priors.Rmd

+if (length(sla_data) > 10 && exists("x") && exists("y")) {
+  # Create plot comparing prior to histogram of data
+  ggplot() +
+    geom_histogram(
+      data = data.frame(sla = sla_data),
+      aes(x = sla, y = after_stat(density)),
+      bins = 30, fill = "steelblue", alpha = 0.6
+    ) +
+    geom_line(
+      data = data.frame(x = x, y = y),
+      aes(x, y),
+      color = "red", linewidth = 1, linetype = "dashed"
+    ) +
+    labs(
+      title = "SLA: Prior Distribution vs. Observed Data",
+      subtitle = "Red dashed = prior, Blue = observed data (Miscanthus + Panicum)",
+      x = "SLA (m2/kg)",
+      y = "Density"
+    ) +
+    xlim(0, 80)
+}


The code at line 200 checks for exists("x") && exists("y") but these variables (x and y) are created within a previous code chunk that only executes conditionally (if (nrow(sla_priors) > 0)). This creates a fragile dependency where the plot will only render if both the SLA priors exist AND the earlier chunk successfully created x and y variables. This code should either store x and y in a way that persists across chunks or restructure the logic to avoid this cross-chunk dependency.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

dlebauer

I've done a quick first review. On a future review I will go through all of the vignettes and explore the tables as they exist.

I am now wondering if we should 1) store the data in CSV files to allow text-based version control and 2) if we can reconstruct traitsview on the fly from the component datasets (i.e. traitsview should not be in data_raw)

dlebauer · 2026-02-21T05:28:29Z

inst/metadata/datapackage.json

+      "path": "https://doi.org/10.1111/gcbb.12420"
+    }
+  ],
+  "resources": [


Here it looks like only the traits view dataset has enumerated fields - is that intentional?

dlebauer · 2026-02-21T05:30:21Z

vignettes/orientation.Rmd

+
+## Available Datasets
+
+The package exports 16 datasets. List them all:


more correct to call betydata a dataset with multiple tables, rather than referring to each table as a 'dataset'

dlebauer · 2026-02-21T05:33:29Z

vignettes/orientation.Rmd

+names(traitsview)
+```
+
+### Key Columns


I propose that we re-organize traitsview a bit so that the key cols are first, and the ids are all to the right. The goal is to make it easier on end users.

dlebauer · 2026-02-21T05:35:03Z

vignettes/orientation.Rmd

+
+```{r basic-exploration}
+# Preview
+head(traitsview[, c("trait", "mean", "units", "scientificname", "author")])


If we put the key cols first and use tibbles, then the preview could simply be:

traitsview

dlebauer · 2026-02-21T05:36:00Z

vignettes/orientation.Rmd

+table(traitsview$checked, useNA = "ifany")
+
+# Work with verified records only
+verified <- traitsview[traitsview$checked == 1, ]


lets consistently use dplyr verbs. They are easier to read

verified <- traitsview |> filter(checked == 1)

dlebauer · 2026-02-21T05:52:59Z

README.md

+
+### Relationship Tables
+
+| Dataset                    | Description                    |


dlebauer · 2026-02-21T05:53:24Z

README.md

+| `priors`      | Prior probability distributions for Bayesian analysis         |
+| `entities`    | Entity identifiers for repeated measures                      |
+
+### Relationship Tables


Briefly explain - how are these used?

dlebauer · 2026-02-21T05:54:04Z

README.md

+library(dplyr)
+traitsview |> count(trait, sort = TRUE)
+
+# Count by genus (top bioenergy crops)


these won't be limited to bioenergy crops since they are not filtered

dlebauer · 2026-02-21T05:55:31Z

README.md

+traitsview |> count(trait, sort = TRUE)
+
+# Count by genus (top bioenergy crops)
+traitsview |> count(genus, sort = TRUE) |> head(10)


I prefer to have a new line after each |>

Suggested change

traitsview |> count(genus, sort = TRUE) |> head(10)

traitsview |>

count(genus, sort = TRUE)

And then rely on the default printing behavior of tibbles to summarize the tables.

dlebauer · 2026-02-21T05:56:27Z

README.md

+
+**Note:** This package exports only `checked >= 0` data. Flagged records (`checked = -1`) are excluded during data preparation. For research requiring unchecked data, access the BETYdb PostgreSQL database directly.
+
+### Access Levels


I think we can remove the access_level columns and all references to the 'access_level' other than to say once that this package includes all public data from BETYdb

divine7022 added 18 commits February 11, 2026 14:22

update DESCRIPTION

1866c45

add .Rbuildignore

ff1485f

add git workflows

4c1bada

ad gitignore

c8d41b5

add citation

86bee56

add NEWS.md

dce94e3

add dataset doc

0a1b59b

add readme

164df97

add csv data

6d69f73

add .rda files

b3f17cb

add inst/

f632581

add .Rd files

29e0fd0

add unit tests

41362a5

add vignettes

cdd8d18

add .html and csv to ignore

69ce06a

update r cmd check

afa3163

add PULL_REQUEST_TEMPLATE.md

dbdf052

add maps to DESCRIPTION

f4444ac

divine7022 requested a review from dlebauer February 11, 2026 21:13

divine7022 added 5 commits February 11, 2026 16:48

add license

b6f9ada

udpate .Rd

11bc4ec

add citation to .Rbuildignore

25653dc

add jsonlit to DESCRIPTION

3faa298

update roxy

0265a5c

dlebauer requested a review from Copilot February 19, 2026 18:42

Copilot started reviewing on behalf of dlebauer February 19, 2026 18:42 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

clarify citation

4e0d685

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

dlebauer requested changes Feb 21, 2026

View reviewed changes


		## Available Datasets

		The package exports 16 datasets. List them all:

	traitsview \|> count(genus, sort = TRUE) \|> head(10)
	traitsview \|>
	count(genus, sort = TRUE)


		Note: This package exports only `checked >= 0` data. Flagged records (`checked = -1`) are excluded during data preparation. For research requiring unchecked data, access the BETYdb PostgreSQL database directly.

		### Access Levels

Conversation

divine7022 commented Feb 11, 2026

Summary

Datasets

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

dlebauer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants