An overview
Explore education statistics (EES) is our bespoke statistics publishing platform, for it to work we rely on standardising the structure of our data files.
This package contains the checks used to enforce our open data standards.
The package contains:
- A core
screen_csv()function, built fromscreen_dfs()andscreen_filenames()plus the constituent individual checks - Functions to generate test data
- Example datasets to aid testing and demonstration
- Additional functions to aid in screening / preparing data for EES
If you’ve come here because your data is failing the screening and
you want help in figuring out why, start off with our
Common file failures article.
screen_csv()
screen_csv() is the core function that reads the CSV
files and then runs all checks from screen_filenames() and
screen_dfs(), providing full screening for a given pair of
CSV files.
This function is the key function used in both the plumber API and
the R Shiny app. It returns a structured list containing a data frame of
results, the stage the screening reached, and booleans for overall pass
and API suitability. With verbose = TRUE it also prints
messages to the console, and with stop_on_error = TRUE it
throws on the first FAIL — useful inside analyst pipelines as a hard QA
gate.
screen_filenames()
Runs all of the filename screening checks in one go. Exported so it
can be used in its own right to check file name pairings. Used within
screen_csv().
screen_dfs()
Runs all of the checks against the data and metadata data frames,
once read in from the CSV files. Used within
screen_csv().
Ordering matters — later checks carry assumptions that earlier ones
passed. This prevents duplication and simplifies the logic in later
checks, at the cost of some complexity for maintainers and anyone
wanting to use individual checks in isolation. The
assumptions_in_checks.Rmd vignette walks through those
assumptions.
Checks are grouped into stages, which are mirrored in the
documentation for the individual functions. _pkgdown.yml
shows the stages in the order they are run. Stages are grouped by area
of the data they relate to, and pre-checks are run first because the
main checks for each area depend on them passing.
Each individual precheck / check returns a single-row data frame of results, with optional console output and error triggers. The standard pattern is:
test_output(
get_check_name(), # name of check function
"PASS", # result of check, one of 'PASS', 'FAIL', 'WARNING'
paste0("'", filename, "' does not contain any special characters."), # feedback message, plain text, no HTML
"https://dfe-analytical-services.github.io/analysts-guide/", # optional URL for guidance (defaults to NA if omitted)
verbose = FALSE, # whether to print messages to console
stop_on_error = FALSE # whether to stop execution and throw on FAIL / warn on WARNING
)Generating test files
generate_test_dfs() creates matching data and metadata
frames for any number of time periods, locations, filters and
indicators.
files <- eesyscreener::generate_test_dfs(
years = 2013:2015,
pcon_names = "Sheffield Central",
pcon_codes = "E14000919",
num_filters = 2,
num_indicators = 3
)
df <- files$data
df_meta <- files$metaGoing bigger
To stress-test on realistic volumes, combine
generate_test_dfs() with the dfeR package
to pass in vectors of Parliamentary Constituencies. Row count is
length(years) * length(pcon_codes) * (5 ^ num_filters).
pcons <- dfeR::fetch_pcons(countries = "England")
beefy <- eesyscreener::generate_test_dfs(
years = c(1980:2025),
pcon_codes = pcons$pcon_code,
pcon_names = pcons$pcon_name,
num_filters = 3,
num_indicators = 45,
verbose = TRUE
)
# duckplyr is dramatically faster than base R for writing large CSVs
# (~20 seconds vs ~6 minutes for the frame above)
duckplyr::compute_csv(beefy$data, "beefy_data.csv")
duckplyr::compute_csv(beefy$meta, "beefy_data.meta.csv")Reading the screen_csv() output
The minimal example in the README shows the output structure for a file that runs the whole pipeline. The three scenarios below cover the shapes you are most likely to see when something goes wrong — a file that fails a check, a file that cannot be read at all, and a file that passes screening but is not suitable for publishing through the API.
File that fails a check
Drop a required metadata column to trigger a FAIL in the
Precheck columns stage. passed will be
FALSE and overall_stage will be the failing
stage.
data_file <- tempfile(fileext = ".csv")
meta_file <- tempfile(fileext = ".meta.csv")
duckplyr::compute_csv(eesyscreener::example_data, data_file)
duckplyr::compute_csv(eesyscreener::example_meta[, -1], meta_file)
eesyscreener::screen_csv(data_file, meta_file, "data.csv", "data.meta.csv")File that cannot be read
If either path does not exist (or the file is not a readable CSV),
screening stops in the File read stage without throwing an
error. passed will be FALSE and
overall_stage will be "File read".
eesyscreener::screen_csv(
"does_not_exist.csv",
"does_not_exist.meta.csv"
)File that passes but is not API suitable
The API checks only ever emit warnings — they do not stop screening —
but any warning in the API stage prevents the data from being published
through the API. The returned api_suitable boolean flags
this.
data_file <- tempfile(fileext = ".csv")
meta_file <- tempfile(fileext = ".meta.csv")
duckplyr::compute_csv(eesyscreener::example_api_long, data_file)
duckplyr::compute_csv(eesyscreener::example_api_long_meta, meta_file)
eesyscreener::screen_csv(data_file, meta_file, "data.csv", "data.meta.csv")