Ideas for eesyscreener should first be raised as a GitHub issue after which anyone is free to write the code and create a pull request for review.
Please also read and follow our Code of Conduct when participating in the project.
Introduction
Explore education statistics (EES) is our bespoke statistics publishing platform, for it to work we rely on standardising the structure of our data files.
This package contains the checks used to enforce our open data standards.
Before contributing, read this guide, skim the package vignettes (especially assumptions_in_checks.Rmd), and open up a handful of existing check_*() functions with their test files side by side. The checks are small, repetitive and well-exampled — the fastest way to understand how a new one should look is to read an existing one.
Setting up for development
You’ll need R >= 4.2.0 (see DESCRIPTION) and RStudio or a similar R-aware editor.
Fork the repo on GitHub and clone your fork locally.
-
Install the package’s development dependencies:
install.packages(c("devtools", "pak")) pak::local_install_dev_deps() Install the
airformatter — it is used in the pre-PR checklist and CI expects formatted code. Follow the install instructions on theairsite for your IDE.-
Sanity-check your setup by loading the package and running the test suite:
devtools::load_all() devtools::test()The full test run can take a few minutes (integration tests screen full CSVs). See Running and skipping tests for how and when to skip the longer tests.
Before opening a PR
Run through this checklist for every contribution, regardless of what you changed:
Regenerate docs —
devtools::document()(updatesNAMESPACEand roxygen.Rdfiles).Format —
air format .from the terminal (or automatically on save in your IDE).Lint —
devtools::load_all(); lintr::lint_package().Run the full test suite —
devtools::test(). Do not merge with tests skipped.-
Regenerate example output if you added or changed a check in the pipeline, or updated any
data-raw/source:devtools::load_all() source("data-raw/example_output.R") Update
NEWS.mdwith a bullet under the current development version heading for any user-facing change (new check, new argument, changed behaviour, bug fix). Internal refactors usually do not need an entry.
Running and skipping tests
By default, all tests run via devtools::test() (full package checks run with devtools::check(), but that skips the integration tests for speed)
We mix unit tests (quick function tests, a few seconds) with integration tests (full CSV screening, a few minutes). The integration tests are essential for end-to-end coverage on realistic files but slow down iteration. Skip them locally with the SKIP_INTEGRATION_TESTS environment variable:
withr::with_envvar(
c(SKIP_INTEGRATION_TESTS = "true"),
devtools::test()
)This skips test-zzz_integration.R, test-ees-robot-tests.R, test-screen_csv.R and test-screen_dfs.R, bringing the run down from minutes to ~30 seconds. The gating logic lives in tests/testthat/helper-integration.R.
Integration tests are skipped on CRAN and in R-CMD-check, but have their own GitHub Action so every PR still covers them.
Always run the full suite (no skip flag) before merging.
Other kinds of contribution
The “How to add a new check” section below is the most detailed recipe because new checks are the most common contribution, but not the only one:
-
Bug fixes — follow the “Before opening a PR” checklist. Add a new test in the existing
tests/testthat/test-<function>.Rfile for the function you’re fixing, or a new example CSV pair in one of thetests/testthat/folders as appropriate. If adding a new CSV test pair then ensure they are made as small as possible to minimise the size of the repo. -
Documentation — roxygen changes still require
devtools::document(). Review the full documentation site locally withdevtools::build_site() -
Reference data (e.g. new geographic lookups, updated acceptable values) — edit the matching
data-raw/*.Rscript and regenerate the.rdaviasource("data-raw/<script>.R"). Also regenerate example output afterwards.
Use the unit tests as documentation
Every check_*() and precheck_*() function has a matching tests/testthat/test-<name>.R. These tests are the canonical spec for how a check is supposed to behave — they cover the happy path, the singular and plural failure messages, and any edge cases (NAs, empty strings, “x” codes, etc.).
If you are unsure what a function does or what counts as a valid / invalid input, read its test file first. It is usually quicker than reading the implementation.
Good reference test files to learn from:
-
tests/testthat/test-check_geog_lad_combos.R– canonical structure for check tests (PASS / singular FAIL / plural FAIL / edge cases) -
tests/testthat/test-check_meta_dupe_label.R– typical metadata-only check -
tests/testthat/test-check_meta_fil_grp_match.R– data + meta check
Package structure
The screen_*() functions are the key user facing exports of the package. screen_csv() is expected to be the primary function used, it takes a pair of CSV files and screens them.
- One script per exported function (except for data objects, or if internal and in
R/utils.R) - All individual checks should be named in accordance with the naming conventions (see “Naming conventions” below)
- Due to the extensive use of check / test in this package, internal functions handling argument validation should follow the
validate_arg_*()convention - All
check_*()functions must return a consistent list structure (see “Return value” below) - All
precheck_*()andcheck_*()functions must use a consistent argument order:data/metainputs first, thenverbose = FALSE, thenstop_on_error = FALSE, then any function-specific optional parameters. This enablesscreen_dfs()to call checks with positional arguments. -
Do NOT validate arguments inside
check_*()orprecheck_*()— validation belongs in the top-levelscreen_*()functions only. Seeassumptions_in_checks.Rmdfor why. -
R/utils.Rcontains all internal helpers. Read it in full before writing any filtering, extraction, or transformation logic — many common operations already have helpers (e.g.get_filters(),get_geo_code_cols(),remove_nas_blanks(),render_url()). -
data-raw/contains the source code for example data and hardcoded reference values. Eachdata-raw/*.Rscript regenerates a matchingdata/*.rda. - Use RDS as the main format for permanent test data (beware it automatically does some cleaning!), make temp CSV files or create a data.frame in code if needed
- Think about dependencies between functions — document any in
assumptions_in_checks.Rmd
Return value
Every check returns test_output(check_name, result, message, verbose, stop_on_error) — a single-row data frame with columns: check, result ("PASS" / "FAIL" / "WARNING"), message, guidance_url. The stage column is added by run_and_log_check() when the check runs inside screen_dfs().
Message patterns
-
PASS:
"All labels are unique."/"The geographic level values are all valid." -
FAIL (singular/plural): use
cli::pluralize()so the same message covers one or many failing values:"The following label {?is/are} duplicated: 'X'." - Format failing values:
paste0("'", paste0(values, collapse = "', '"), "'") - Plain text only. Do not include HTML (
<br>,<b>, etc.) — messages are consumed by CLI, API, and Shiny contexts.
Dependency rules
- Use the
.data$column_namepronoun (rlang) in dplyr package code. - Use the base R pipe
|>(not%>%). -
cli::cli_abort()for errors,cli::cli_warn()for warnings (notstop()/warning()). - All internal helpers go in
R/utils.R(@keywords internal,@noRd).
Naming conventions
Follow these patterns when naming new check functions. Consistency is crucial as names are used in _pkgdown.yml to group checks for documentation.
General pattern
check_<area>_<what>() # Content validation (can produce warnings)
precheck_<area>_<what>() # Early validation (blocks on failure)
Area prefixes
| Area | Prefix | Notes |
|---|---|---|
| Columns | col_ |
Generic column properties |
| Metadata | meta_ |
Metadata file validation |
| Data | data_ |
Data file content validation |
| Filters | filter_ |
Filter group logic |
| Indicators | ind_ |
Indicator-specific rules |
| Geography | geog_ |
Geographic hierarchy validation |
| Time | time_ |
Time period validation |
| API | api_ |
API-specific constraints |
| Filename | filename_ |
File naming conventions |
Abbreviations for brevity
Use these standard abbreviations in function names to keep them concise while remaining readable:
| Full term | Abbreviation | Context |
|---|---|---|
| column | col |
check_col_names_spaces |
| metadata | meta |
check_meta_label |
| filter_group | fil_grp |
check_meta_fil_grp |
| duplicate | dupe |
check_meta_dupe_label |
| is_filter | is_fil |
check_meta_fil_grp_is_fil |
| indicator | ind |
check_ind_invalid_entry |
| decimal_places | dp |
check_meta_ind_dp_values |
| geography | geog |
precheck_geog_level |
| observation_unit | ob |
precheck_meta_ob_unit |
| location | loc |
check_api_char_loc_code |
Examples
Good naming: - check_meta_dupe_label() – finds duplicate indicator labels in metadata - check_meta_fil_grp_is_fil() – validates that filter group items are valid filters - check_api_char_col_name() – validates column name length for API publishing - precheck_geog_level() – ensures geographic level column exists - check_filter_defaults() – validates default filter selections
Anti-patterns to avoid: - check_metadata_duplicate_indicator_label() – too verbose - check_something_validation() – redundant (checks are validation by nature) - check_foo_and_bar() – combines too many concepts
Adding to _pkgdown.yml
When adding a new check: 1. Add a new check_*() or precheck_*() function following the naming pattern 2. Update _pkgdown.yml to include it in the appropriate category section 3. The documentation site will automatically group it with related checks
Example in _pkgdown.yml:
This automatically picks up all check_meta_* functions.
How to add a new check
-
Create the R file –
R/check_<area>_<what>.Rwith the standard signature:#' @export check_my_validation <- function(data, meta, verbose = FALSE, stop_on_error = FALSE) { check_name <- get_check_name() # never hardcode the name string # validation logic test_output( check_name, "PASS", # or "FAIL" / "WARNING" "message...", verbose = verbose, stop_on_error = stop_on_error ) } Write the tests first or alongside the function – see “How to add a new test” below.
Wire it into the pipeline – add a call in
R/screen_dfs.Rin the appropriate stage (filename,precheck_col,precheck_meta,check_meta,precheck_time,check_time,precheck_geog,check_geog,check_filter,check_api) via the existingrun_and_log_check()/rbind()pattern.-
Regenerate example output – required whenever a check is added to the pipeline:
devtools::load_all() source("data-raw/example_output.R") -
Document – roxygen2 with
@export,@examples, and the appropriate@inheritParamsand@inherit ... returntags to avoid duplicating parameter documentation. Pick the nearest donor from the table below:Params needed Inherit from meta,verbose,stop_on_errorprecheck_meta_col_typedata,meta,verbose,stop_on_errorprecheck_col_to_rowsdata,verbose,stop_on_errorcheck_col_names_spacesUse
@familytags that match the_pkgdown.ymlgroupings:filename,precheck_col,precheck_meta,check_meta,precheck_time,check_time,precheck_geog,check_geog,check_filter,check_api. Update
_pkgdown.ymlonly if a new section is needed (existing sections pick up new functions automatically viastarts_with()).Run through the Before opening a PR checklist. The
test-example_output_coverage.Rtest will fail if you forgot to wire the check intoscreen_dfs().
How to add a new test
Every check has a test file at tests/testthat/test-<function_name>.R. Use tests/testthat/test-check_geog_lad_combos.R as the template — it covers the full shape.
A new test file must cover:
-
PASS with package example data – usually
example_data/example_metastraight fromR/example_datasets.R. -
FAIL with a single problem – asserts the singular message form and checks
guidance_urlwhere relevant. -
FAIL with multiple problems – asserts the plural message form (
cli::pluralizeoutput) and exercisesstop_on_error = TRUEwithexpect_error(...). -
Edge cases – NAs, empty strings, single values,
"x"not-available codes,"z"universal codes, absent optional columns.
Building test data
-
Prefer package example datasets as the base — use
rbind()ordplyr::mutate()onexample_data,example_meta,example_filter_group_wrow, etc. (seeR/example_datasets.R) to introduce the failing condition. Only construct a full inlinedata.frame()when no example dataset has the right schema. -
Extract multi-line construction to a local variable – assign to
bad_data/bad_metabefore asserting, then reuse the same variable for the result check and thestop_on_errorcheck. Never construct the same data frame twice in onetest_that()block. -
Edge case tests must actually test the edge case – don’t re-run an existing assertion under a different label; construct data that would fail if the edge case were not handled (e.g. set
filter_grouping_column = NA_character_to verify NAs are ignored, not just re-run the standard PASS case).
What not to test
- Do not test argument validation — that is the orchestrator’s job, already covered by
test-screen_dfs.R/test-screen_csv.R. - Do not test
verbose/stop_on_errorplumbing — covered generically bytest-test_output.R. - Do not test pipeline integration — covered by
test-example_output_coverage.Rand the integration tests.
Working with geography
Geography is the most branching area of the package — there are ~18 levels (National, Regional, Local authority, etc.) each with their own code / name columns, lookups and per-level checks. When adding, renaming, or tweaking a geographic level, touch these places:
| File | Role |
|---|---|
data-raw/geography_df.R |
The master table of levels and their code / name / secondary-code columns. Edit here first. |
data-raw/acceptable_geog_combos.R |
Per-level lookups of valid code / name combinations (e.g. acceptable_lads, acceptable_pcons). |
R/utils.R |
get_geo_code_cols() / get_geo_name_cols() return the full list of geography code / name columns. These drive several generic checks — update when adding a level. |
R/check_geog_combos.R |
Houses .check_geog_combos() (shared implementation) plus the thin per-level wrappers (check_geog_lad_combos(), check_geog_pcon_combos(), etc.). To add a level, add another wrapper that calls .check_geog_combos() with the right code_col, name_col, acceptable_data, and restricted_level. |
R/screen_dfs.R |
Wire the new per-level combos check into the check_geog stage. |
tests/testthat/test-check_geog_<level>_combos.R |
Copy the closest existing level’s test (e.g. test-check_geog_lad_combos.R) and adapt. |
_pkgdown.yml |
Only needs editing if you rename the section heading; new check_geog_* functions are picked up by starts_with(). |
After editing data-raw/, regenerate the .rda files:
Then regenerate example output if any check or lookup changed:
source("data-raw/example_output.R")Stylistic preferences
This package has a big priority on efficiency — we need to keep it fast so the Shiny app and API endpoint stay responsive on large files.
The screen_csv() function runs checks lazily on any data files above 5 MB using duckplyr methods overwriting dplyr. For any files under 5 MB it materialises immediately and uses the dplyr methods. This approach gives the ‘Hovis Best of Both’ of simple efficiency for small files with minimal overhead, but still leverages DuckDB’s lazy power for larger datasets.
- Profile performance and use the fastest available approach
- Test on large files (5 million rows and above), and prioritise large-file performance over small-file performance
- Avoid duplication between functions — lift shared logic into
R/utils.R - Use
dplyrverbs thatduckplyrcan translate to DuckDB.data.table, another traditionally fast R framework for data processing, is not necessary and would force data.frame ↔︎ data.table switching.
If you have issues with linting and dplyr variables showing “no visible binding”, follow the guide to using dplyr in packages.
You can use tests/utils/profiling.R as a starting point for experiments on large tables.
duckplyr fallbacks and silent materialisation
duckplyr will fall back to plain dplyr (often verbosely) when it cannot translate an operation to DuckDB, and some operations will quietly materialise a lazy table and defeat the performance work. Both categories are worth keeping an eye on.
See the Diagnosing duckplyr fallbacks vignette for the common triggering patterns, patterns that work fine, diagnostic recipes, and how to use prudence = "stingy" to catch unintended materialisation. test-avoid_materialisation.R runs the materialisation check in CI.
Avoid per-column query loops
The costliest anti-pattern we found in this codebase was iterating over columns and firing a separate DuckDB query per column:
# Anti-pattern: 1 query per column
for (col in data_cols) {
vals <- data |>
dplyr::select(dplyr::all_of(col)) |>
dplyr::distinct() |>
dplyr::pull(1)
...
}On a 6 M-row file with 55 columns, this pattern cost ~135 s for check_general_null alone. The fix is summarise(across(...)), which DuckDB executes as a single aggregation pass in around 9 s.
Boolean presence check (does any row match?)
# Good: 1 query, all character columns at once
char_cols <- names(dplyr::select(data, dplyr::where(is.character)))
result_row <- data |>
dplyr::summarise(dplyr::across(
dplyr::all_of(char_cols),
~ sum(. %in% target_values) > 0
)) |>
dplyr::collect()
cols_that_match <- names(result_row)[unlist(result_row, use.names = FALSE)]Why restrict to character columns? DuckDB cannot compare a BIGINT column to a string literal without an explicit cast, and as.character() inside across() has no SQL translation in duckplyr. String target values can only appear in VARCHAR columns anyway, so skipping numeric columns is both safe and necessary. See check_general_null.R and check_ind_invalid_entry.R.
Why pre-compute char_cols outside the pipeline? where(is.character) works fine in select() (schema inspection only, no data query). But inside summarise(across(where(is.character), ...)) duckplyr tries to translate the predicate to SQL and fails. Always extract column names first, then pass a plain character vector via all_of().
Distinct-count per column
# Good: 1 query, COUNT(DISTINCT col) for every column
counts_row <- data |>
dplyr::summarise(dplyr::across(
dplyr::all_of(cols),
~ dplyr::n_distinct(.)
)) |>
dplyr::collect()
counts <- setNames(unlist(counts_row, use.names = FALSE), cols)Note ~ dplyr::n_distinct(.) — the dplyr:: prefix is required. Bare n_distinct is not found in the across() evaluation environment. See check_filter_item_limit.R and check_filter_group_level.R.
Stingy safety
Any new multi-column aggregation must remain lazy until collect(). Run the stingy check locally before opening a PR:
devtools::test(filter = "avoid_materialisation")If it fails, the operation materialised before collect(). The usual fix is the char_cols pre-computation above or removing an untranslatable function from the SQL pipeline.
