Check for duplicate rows in data

Anyone else picturing a cartoon Army general nicknamed 'Dupes'? Just me? Nevermind, read on anyway...

Usage

check_general_dupes(
  data,
  meta,
  verbose = FALSE,
  stop_on_error = FALSE,
  return_dupes = FALSE
)

Arguments

data: A data frame of the data file
meta: A data frame of the metadata file
verbose: logical, if TRUE prints feedback messages to console for every test, if FALSE run silently
stop_on_error: logical, if TRUE will stop with an error if the result is "FAIL", and will throw genuine warning if result is "WARNING"
return_dupes: logical, if TRUE returns a data frame of the rows that are duplicated (across observational unit and filter columns) instead of the standard check result. All columns from the original data are included. Rows at excluded geographic levels are not included. When there are no duplicates, an empty data frame with the same columns is returned. Defaults to FALSE.

Value

a single row data frame

Details

Checks for duplicate rows across observational unit and filter columns. School, Provider, Institution, and Planning area rows are handled specially: when data contains exclusively School or Provider rows, only Institution and Planning area rows are excluded before checking. In all other cases, School, Provider, Institution, and Planning area rows are all excluded from the duplicate check.

This check is intended to catch cases where there are multiple observation values for any option that users would be able to select in the table on EES. If this were to happen there's no guarantee what the users would see and the data would be compromised. On top of this, this check will also catch any full duplicate rows at the same time.

To help with debugging, the check can optionally return the rows that are duplicated across the observational unit and filter columns. This way users can see which rows are duplicated to help with debugging their code. Many an adamant analyst has challenged this code and said it was wrong before, however, every single time the code has been proven right. General dupes never lies!

Examples

check_general_dupes(example_data, example_meta)
#>           check result
#> 1 general_dupes   PASS
#>                                                                                                                                          message
#> 1 There are no duplicate rows in the data file. Note that School, Provider, Institution, and Planning area rows were not included in this check.
#>   guidance_url         duration
#> 1           NA 0.004716635 secs
check_general_dupes(example_data, example_meta, return_dupes = TRUE)
#> [1] time_period      time_identifier  geographic_level country_code    
#> [5] country_name     sex              education_phase  enrolment_count 
#> <0 rows> (or 0-length row.names)

Usage

Arguments

Value

Details

See also

Examples