Contributing to eesyscreener
Source:.github/CONTRIBUTING.md
Ideas for eesyscreener should first be raised as a GitHub issue after which anyone is free to write the code and create a pull request for review.
Introduction
Explore education statistics (EES) is our bespoke statistics publishing platform, for it to work we rely on standardising the structure of our data files.
This package contains the checks used to enforce our open data standards.
Before contributing to the package, you should read this, and the vignettes describing the package, to understand all of the logic and decisions made. Particularly the assumptions_in_checks.Rmd
vignette as that provides crucial information about interdependencies between functions within the package.
Package structure
The screen_*()
functions are the key user facing exports of the package.
screen_csv()
is expected to be the primary function used, it takes a pair of CSV files and screens them.
- One script per exported function (except for data objects or if internal and in
R/utils.R
) - All individual checks should be named in accordance with the naming conventions in the
_pkgdown.yml
file - Due to the extensive use of check / test in this package, internal functions handling argument validation should follow the
validate_arg_*()
convention - All
check_*()
functions must return a consistent list structure -
R/utils.R
contains all internal functions -
data-raw/
contains the source code for example data and hardcoded variables - Use RDS as the main format for permanent test data (beware it automatically does some cleaning!), make temp CSV files or create a data.frame in code if needed
- Think about dependencies between functions - explain any in the
assumptions_in_checks.Rmd
vignette
Stylistic preferences
This package has a big priority on efficiency, we need to keep it fast so the Shiny app and API endpoint are responsive even with larger files
- performance profile and use the fastest available functions
- test on large files (5 million rows and above), and prioritise large file performance over small file performance
- avoid duplication between functions
- don’t validate arguments in individual
precheck_*()
orcheck_*()
functions
duckplyr seems to be the fastest approach, take the following example code, just aimed at checking if the time_identifier values are valid
# Base R
setdiff(
unique(as.character(data$time_identifier)),
eesyscreener::acceptable_time_ids
)
# dplyr / duckplr (methods_overwrite / methods_restore)
data |>
dplyr::distinct(.data$time_identifier) |>
dplyr::anti_join(
data.frame("time_identifier" = eesyscreener::acceptable_time_ids),
by = "time_identifier"
) |>
dplyr::pull(time_identifier)
Initially the base R looks simpler, less code, so likely faster…
Ran through microbenchmark the results on a ~6 million row data.frame in milliseconds were:
- base R ~ 3,231 ms
- dplyr ~ 61.6 ms
- duckplyr ~ 5.7 ms
For the eesyscreener::example_data, a tiny file, results were different, though as mentioned above, we should prioritise the larger files as that is where the greatest impact is felt:
- base R ~ 62.6 microseconds
- dplyr ~ 1,865 microseconds
- duckplr ~ 4,774 microseconds
If Frederick hadn’t already told us enough times, duck is king. We can write nice readable dplyr code, but utilise the power of the duck. Best of both worlds.
If you have issues with linting and dplyr variables showing no visible binding, follow the guide to using dplyr in packages.
duckplyr messages
duckplyr is particularly verbose, and will tell you when it’s falling back to dplyr, often you may get messages like the following:
The duckplyr package is configured to fall back to dplyr when it encounters an incompatibility. Fallback events can be collected and uploaded for analysis to guide future development. By default, data will be collected but no data will be uploaded. i Automatic fallback uploading is not controlled and therefore disabled, see
?duckplyr::fallback()
. v Number of reports ready for upload: 1.
These are safe to ignore, though detail out the places where duckplyr has tried and failed to run, often if you investigate through duckplyr::fallback_review()
you can pinpoint where in the code you tried to get duckplyr to do something it didn’t want.
Uploading the reports using duckplyr::fallback_upload()
will clear them from your machine and submit to the maintainers.
If you don’t care for looking at them more, you can set them to auto upload (and stop shouting at you) with duckplyr::fallback_config(autoupload = TRUE)
.
Process for moving in functions from app
Tracking spreadsheet set up in sharepoint, this contains a table of all checks, thoughts on new groupings and notes about the migration (including the progress).
Example of adding a new check available in PR XX
Other commits showing checks being added:
-
precheck_col_req_data()
, also showing adding the req_data_cols data object - https://github.com/dfe-analytical-services/eesyscreener/commit/98a84e3e7733e99c0fefc90ddb7d876f793ae780 -
precheck_meta_col_name()
, simple one that simplifies the original function - https://github.com/dfe-analytical-services/eesyscreener/commit/7e1908f76d5a14d29363fbba45452da8dba60775
Tackle one screening check at a time (unless you can create utility functions that cover multiple repeated checks)
Set up a new script and test script for the check, copy over the relevant code from https://github.com/dfe-analytical-services/dfe-published-data-qa
Add unit tests for the the function
Adapt the function so that the input arguments and outputs match existing
check_*
functionsIntegrate the new
check_*
into the appropriatescreen_dfs()
Document the code inheriting documentation from existing functions where possible
There will be a lot of functions in this package, so we want to minimise the amount of duplicated code.
Use @inheritParams function_name to copy in param definitions from another
Use #' @inherit check_filename_spaces return
to copy the return documentation for functions
- Once setup, review and improve the code
Make sure the tests are passing and you’ve commited before this point to ensure you don’t inadvertently break anything!
- Have a look through examples of simplifying the code to see if anything could apply to the function you’re working on, e.g.
Simplifying a check for missing values, where previously it used a pre_check object and sapply - https://github.com/dfe-analytical-services/eesyscreener/commit/290679711d82529a7d0c54b63cac61ae92a350d8
Create and use helper functions (if you haven’t already), place any helper functions in
R/utils.R
, mark as @keywords internalAvoid bringing in too many new dependencies, try rewriting in base R where possible
Use
microbenchmark
to experiment to find the most efficient approach (speed is king here) e.g.
data <- eesyscreener::example_data
microbenchmark::microbenchmark(
nrow(data) - nrow(janitor::remove_empty(data, which = "rows", quiet = TRUE)),
sum(apply(data, 1, function(row) all(is.na(row) | row == ""))),
sum(rowSums(is.na(data) | data == "") == ncol(data)),
times = 100000
)
In this example, the first line requires the janitor package. It is slightly faster with a mean of 115ms, the second line mean was 184ms, and the third line mean was 121ms, as there’s a minimal amount between 1 and 3, usually we’d go for option 3 to avoid needing to call in the janitor package as a dependency, keeping this package more lightweight.
However, when testing with a bigger file (~6million rows), the 2nd and 3rd options were so slow to run that microbenchmark was taking too long to produce a result. Clearly using janitor in option 1 has a massive impact when there’s more rows, so the decision was made to import the janitor package as the speed benefit was worth the extra dependency.
You can use the tests/utils/benchmarking.R
script as a starting point, it includes code to generate big tables and run benchmarking. Sometimes the fastest on small files will not be the fastest on bigger ones. Prioritise the bigger files as that’s where the biggest impact will be felt.
In particular look at using dplyr
verbs that can use duckplyr
for speedier alternatives as we enable duckplyr within the screen_dfs()
. data.table
probably isn’t necessary and would cause us to need to switch between a data.frame and a data.table costing time.
TODO: At the end of the migration, we should then plan how to migrate the test data over, so that all of the existing edge cases tested in the R Shiny app can be covered here.
Make use of the
tests/utils/copy_check_data.R
script to copy over CSVs from the screener repo and save as RDS files in the tests folderThere should be a collection of unit tests and at least one example data file with an expected failure for every function