Chapter 6 Tidy data manipulation
6.1 Packages
One of the great things about R (and lots of other open source programming languages) is that you can add in extra functionality really easy. These extra functions come in the form of packages.
6.2 Setup
To get a package, we first need to install it. This can be done in the console, as once installed, we don’t need to reinstall every time we start a new session of R.
We are going to install a package called dplyr, a package to manipulate data.
A lot of funny stuff will come up in your Console, ignore it, it’s all normal (or should be!).
Once we’ve installed it, we have to load the package. This we do have to do every time we start a new session of R.
6.2.1 What we’ll learn with dplyr
dplyr
is a fantastic R package developed to help us manipulate data easily. With dplyr
we can…
Select columns using
dplyr::select()
Rename columns using and
dplyr::rename()
Select distinct combinations of variables using
dplyr::distinct()
Arrange data from high to low using
dplyr::arrange()
anddplyr::top_n()
Filter data using
dplyr::filter()
Create new variable, recode variables and change column classes using
dplr::mutate()
Create summary statistics using a combination of
dplyr::group_by()
anddplyr::summarise()
6.3 Piping notation with dplyr
When we do a piece of data manipulation, we need to then put what we’ve done into a new object.
name_of_output_dataframe <- input_dataframe %>%
dplyr::function(variable_name_1, variable_name2, ...)
Example:
This will give us an object in our environment (place where our data lives in RStudio) called selected_data_example
which will contain the variables la_name
and la_number
from the swfc_headcount
data.
6.4 Selecting variables
- In the above example we’ve seen how we can select two variables such as
la_name
andla_number
.
- If we have an object and we want all the columns bar one, we can use a minus sign. e.g. to get all the columns in the
swfc_headcount
data bar thela_name
variable, we can use:
Exercise
- Now can you use the same logic to select the columns
region
andla_name
? Output this to an object calledexercise_1
. Below is a starting guide to help you get going.
- Use the same logic to select the columns
teacher_count
andschool_name
? Output this to an object calledexercise_2
.
- Can you select all the variables in the
swfc_headcount
data apart from theschool_name
variable? Output this to an object calledexercise_3
.
6.5 Renaming variables
- To rename a variable, we use
dplyr::rename()
and use the structure of saying the new name we’d like for our variable and then setting it equal to the variable as it’s named. E.g:
- If I wanted to rename the
la_number
variable to be calledla_num
, I’d run:
- Each time, this returns all of the variables in the data with the one we’ve renamed changed! I can rename multiple variables by using a comma such as;
renamed_data_2 <- swfc_headcount %>%
dplyr::rename(la_num = la_number,school_name_at_2018 = school_name)
- We can rename variables to have name’s we’d like to have presentationally with spacing in, but programming languages tend to not like this very much, so if you are to do this, please do this at the end of your analysis! Note: The use of backticks `.
Exercise
- Rename the
region
variable from theswfc_headcount
dataframe to be calledgovt_office_region
.
6.6 Distinct combinations
Sometimes we may just want to see which things are in our data, such as which schools or which regions. This can be a challenge when we have multiple rows for each observation. We can use the dplyr::distinct()
function to do just this.
- If I wanted to see the distinct
region
s in the data I could run:
- If I wanted to see a distinct combination of multiple varibles, for example to make a lookup table of all the local authority names and regions, I could run:
Exercise
- Create a lookup table of local authority names and numbers using
la_name
andla_number
. We’ve got you started below:
6.7 Making new variables / changing data types
- If I wanted to make a dummy variable to identify which schools have more than 50 teachers, I could do the following:
added_variable_example_1 <- swfc_headcount %>%
dplyr::mutate(more_than_100_teachers = ifelse(as.numeric(as.character(teacher_count)) > 50, 1, 0))
Tip:
Note: In the above example we made sure to treat the teacher_count variable as a numeric value, why is this?
Programming languages are very good at spotting trends in data and basically defining data in a way it sees fit. In this example, out swfc_headount
data has a column of teacher_count
. To me and you, we’d think of this as a numeric value anyway. But, in the data, we have many DNS
(Did not submit) values which R understands to be characters. So, as the column for teacher_count
contains both numeric and character values, it automatically will read them in a way that is logical. Numeric values can be read as character strings, e.g. 1,2, etc. But “School” cannot possibly be numeric, so the column is read as a character. So, if we look at the structure of the teacher_count variable, using
We can make a new variable, which reads teacher_count
as numeric, which will turn the DNS
(Did not sumbit) values into NA’s, so we can perform mathematical operations on the column.
numeric_column_example <- swfc_headcount %>%
dplyr::mutate(teacher_count_numeric = as.numeric(as.character(teacher_count)))
Now, if we wanted to, we can perform maths on that dataframe more easily e.g.
6.8 Arranging data
We may be interested to see our data and look at the school’s with the biggest workforce. To do that we need to arrange our data.
Exercise
- Can you now look at which school has the most teachers? Use the
teacher_count
variable.
6.9 Filtering data
We may want to just have the data for a particular region or local authority. Sometimes we may have multiple conditions needed when we’re looking for particular data. The dplyr::filter()
function allows us to get that data.
- If I wanted to just get the data for my own local authority, I can filter the data to find just that, e.g.
- If I wanted to get all the data for all local authorities other than Wigan, I could use
- If I wanted to get all the data for two local authorities I could use
- If I wanted to filter data on multiple conditions, e.g. on
la_name
andschool_type
, I could use
filtered_data_example_4 <- swfc_headcount %>%
dplyr::filter(la_name == "Wigan" & school_type == "LA maintained schools")
- If I wanted to filter the data to find schools with more than 20 teachers,
Exercise
Can you find the data for teacher headcount in Westminister?
Can you find how many teachers work at “Beckford Primary School”?
6.10 Making new variables
- If I wanted to make a dummy variable to identify which schools have more than 50 teachers, I could do the following:
6.11 Summary statistics
We can create summary statistics using dplyr
, which groups data by certain characteristics and then performing certain calculations - counts of each group or averages for each group - a really popular feature in Excel; can be replicated using dplyr.
- Counting the number of each type of school:
#Calculate the number of schools by school type
school_type_count <- swfc_headcount %>%
dplyr::group_by(school_type) %>%
dplyr::summarise(count_schools = n()) %>%
dplyr::ungroup()
Exercise
Now use the same logic to calculate the number of schools per la_name
!