Chapter 6 Tidy data manipulation

6.1 Packages

One of the great things about R (and lots of other open source programming languages) is that you can add in extra functionality really easy. These extra functions come in the form of packages.

6.2 Setup

To get a package, we first need to install it. This can be done in the console, as once installed, we don’t need to reinstall every time we start a new session of R.

We are going to install a package called dplyr, a package to manipulate data.

A lot of funny stuff will come up in your Console, ignore it, it’s all normal (or should be!).

Once we’ve installed it, we have to load the package. This we do have to do every time we start a new session of R.

6.2.1 What we’ll learn with dplyr

dplyr is a fantastic R package developed to help us manipulate data easily. With dplyr we can…

  • Select columns using dplyr::select()

  • Rename columns using and dplyr::rename()

  • Select distinct combinations of variables using dplyr::distinct()

  • Arrange data from high to low using dplyr::arrange() and dplyr::top_n()

  • Filter data using dplyr::filter()

  • Create new variable, recode variables and change column classes using dplr::mutate()

  • Create summary statistics using a combination of dplyr::group_by() and dplyr::summarise()

6.3 Piping notation with dplyr

When we do a piece of data manipulation, we need to then put what we’ve done into a new object.

Example:

This will give us an object in our environment (place where our data lives in RStudio) called selected_data_example which will contain the variables la_name and la_number from the swfc_headcount data.

6.4 Selecting variables

  1. In the above example we’ve seen how we can select two variables such as la_name and la_number.
  1. If we have an object and we want all the columns bar one, we can use a minus sign. e.g. to get all the columns in the swfc_headcount data bar the la_name variable, we can use:

Exercise

  1. Now can you use the same logic to select the columns region and la_name? Output this to an object called exercise_1. Below is a starting guide to help you get going.
  1. Use the same logic to select the columns teacher_count and school_name? Output this to an object called exercise_2.
  1. Can you select all the variables in the swfc_headcount data apart from the school_name variable? Output this to an object called exercise_3.

6.5 Renaming variables

  1. To rename a variable, we use dplyr::rename() and use the structure of saying the new name we’d like for our variable and then setting it equal to the variable as it’s named. E.g:
  1. If I wanted to rename the la_number variable to be called la_num, I’d run:
  1. Each time, this returns all of the variables in the data with the one we’ve renamed changed! I can rename multiple variables by using a comma such as;
  1. We can rename variables to have name’s we’d like to have presentationally with spacing in, but programming languages tend to not like this very much, so if you are to do this, please do this at the end of your analysis! Note: The use of backticks `.

Exercise

  1. Rename the region variable from the swfc_headcount dataframe to be called govt_office_region.

6.6 Distinct combinations

Sometimes we may just want to see which things are in our data, such as which schools or which regions. This can be a challenge when we have multiple rows for each observation. We can use the dplyr::distinct() function to do just this.

  1. If I wanted to see the distinct regions in the data I could run:
  1. If I wanted to see a distinct combination of multiple varibles, for example to make a lookup table of all the local authority names and regions, I could run:

Exercise

  1. Create a lookup table of local authority names and numbers using la_name and la_number. We’ve got you started below:

6.7 Making new variables / changing data types

  1. If I wanted to make a dummy variable to identify which schools have more than 50 teachers, I could do the following:

Tip:

Note: In the above example we made sure to treat the teacher_count variable as a numeric value, why is this?

Programming languages are very good at spotting trends in data and basically defining data in a way it sees fit. In this example, out swfc_headount data has a column of teacher_count. To me and you, we’d think of this as a numeric value anyway. But, in the data, we have many DNS (Did not submit) values which R understands to be characters. So, as the column for teacher_count contains both numeric and character values, it automatically will read them in a way that is logical. Numeric values can be read as character strings, e.g. 1,2, etc. But “School” cannot possibly be numeric, so the column is read as a character. So, if we look at the structure of the teacher_count variable, using

We can make a new variable, which reads teacher_count as numeric, which will turn the DNS (Did not sumbit) values into NA’s, so we can perform mathematical operations on the column.

Now, if we wanted to, we can perform maths on that dataframe more easily e.g.

6.8 Arranging data

We may be interested to see our data and look at the school’s with the biggest workforce. To do that we need to arrange our data.

Exercise

  1. Can you now look at which school has the most teachers? Use the teacher_count variable.

6.9 Filtering data

We may want to just have the data for a particular region or local authority. Sometimes we may have multiple conditions needed when we’re looking for particular data. The dplyr::filter() function allows us to get that data.

  1. If I wanted to just get the data for my own local authority, I can filter the data to find just that, e.g.
  1. If I wanted to get all the data for all local authorities other than Wigan, I could use
  1. If I wanted to get all the data for two local authorities I could use
  1. If I wanted to filter data on multiple conditions, e.g. on la_name and school_type, I could use
  1. If I wanted to filter the data to find schools with more than 20 teachers,

Exercise

  1. Can you find the data for teacher headcount in Westminister?

  2. Can you find how many teachers work at “Beckford Primary School”?

6.10 Making new variables

  1. If I wanted to make a dummy variable to identify which schools have more than 50 teachers, I could do the following:

6.11 Summary statistics

We can create summary statistics using dplyr, which groups data by certain characteristics and then performing certain calculations - counts of each group or averages for each group - a really popular feature in Excel; can be replicated using dplyr.

  • Counting the number of each type of school:

Exercise

Now use the same logic to calculate the number of schools per la_name!