Chapter 5 Base R
R comes with a number of functions that allow you to maniuplate, summarise and analyse data. Collectively these are known as Base R.
This chapter summarises some of the core functions for manipulating data in base R.
5.1 Loading in data
The most common process to load in data is to load in a CSV file.
#Load in data
swfc_16_init <- read.csv("data/SWFC_2016_Machine_Readable.csv")
swfc_16 <- swfc_16_init
Tip: Load in and create and initial, raw version of the file, and never do anything to that object. This means if you muck anything up, you’ll always have a clean dataset to start again from. This is particularly important when loading in big dataframes, such as those from SQL.
5.2 Basic dataframe functions
There are a number of functions which can be used to gain a summary of data:
Tip: This worksheet is built using an add on to R called R Markdown, which integrates code, text, and images. We’ll come on to how to create R Markdown documents later, but for now, the box below with a white background and grey border is an output from the above code as it renders in R Markdown. Remember this, it will appear a lot in the worksheet.
## LA_Number Establishment_Number LAEstab_2016
## Min. :201.0 Min. :1000 Min. :2013614
## 1st Qu.:371.0 1st Qu.:2115 1st Qu.:3712144
## Median :850.0 Median :3001 Median :8502761
## Mean :697.6 Mean :3133 Mean :6979196
## 3rd Qu.:891.0 3rd Qu.:3605 3rd Qu.:8912352
## Max. :938.0 Max. :7750 Max. :9387022
##
## LA_Name URN
## Lancashire : 632 Min. :100000
## Kent : 581 1st Qu.:110587
## Essex : 556 Median :120528
## Hertfordshire: 535 Mean :122340
## Hampshire : 531 3rd Qu.:137029
## Birmingham : 447 Max. :143768
## (Other) :18631
## School_Name
## St Joseph's Catholic Primary School : 57
## St Mary's Catholic Primary School : 41
## St Anne's Catholic Primary School : 20
## St Patrick's Catholic Primary School: 20
## Holy Family Catholic Primary School : 19
## St Peter's Catholic Primary School : 18
## (Other) :21738
## LA_Number Establishment_Number LAEstab_2016 LA_Name URN
## 1 331 4004 3314004 Coventry 141104
## 2 332 2010 3322010 Dudley 103774
## 3 330 4010 3304010 Birmingham 139788
## 4 332 2012 3322012 Dudley 103775
## 5 332 2043 3322043 Dudley 103781
## 6 210 1025 2101025 Southwark 100768
## School_Name
## 1 Seva School
## 2 Kates Hill Community Primary School
## 3 Waverley Studio College
## 4 Northfield Road Primary School
## 5 Dawley Brook Primary School
## 6 Ann Bernadt Nursery School
## LA_Number Establishment_Number LAEstab_2016 LA_Name URN
## 21908 855 2092 8552092 Leicestershire 143250
## 21909 831 2424 8312424 Derby 112728
## 21910 882 2105 8822105 Southend-on-Sea 143341
## 21911 936 2269 9362269 Surrey 142433
## 21912 931 2561 9312561 Oxfordshire 137992
## 21913 831 1009 8311009 Derby 112475
## School_Name
## 21908 Newcroft Primary Academy
## 21909 Pear Tree Infant School
## 21910 Thorpe Greenways Infant School
## 21911 Lightwater Village School
## 21912 Faringdon Infant School
## 21913 Walbrook Nursery School
## [1] "LA_Number"
## [2] "Establishment_Number"
## [3] "LAEstab_2016"
## [4] "LA_Name"
## [5] "URN"
## [6] "School_Name"
## [7] "School_Type_Description"
## [8] "School_Type"
## [9] "School_Phase"
## [10] "Religious_Character"
## [11] "Government_Office_Region_Name"
## [12] "Parliamentary_Constituency"
## [13] "LA_District"
## [14] "Ward"
## [15] "StatutoryLowAge"
## [16] "StatutoryHighAge"
## [17] "Tot_Workforce_HC"
## [18] "Tot_Classroom_Teachers_HC"
## [19] "Tot_Teachers_Leadership_HC"
## [20] "Tot_Teachers_HC"
## [21] "Tot_TAs_HC"
## [22] "Tot_NonClassroom_Support_Staff_Exc_Aux_Staff_HC"
## [23] "Tot_Auxiliary_Staff_HC"
## [24] "Perc_PT_Teaching_Staff_"
## [25] "Tot_Workforce_FTE"
## [26] "Tot_Classroom_Teachers_FTE"
## [27] "Tot_Teachers_Leadership_FTE"
## [28] "Tot_Teachers_FTE"
## [29] "Tot_TAs_FTE"
## [30] "Tot_NonClassroom_Support_Staff_Exc_Aux_Staff_FTE"
## [31] "Tot_Aux_Staff_FTE"
## [32] "TA_Teacher_Ratio"
## [33] "Pupil_Teacher_Ratio"
## [34] "Perc_Male_Teachers"
## [35] "Perc_Minority_Ethnic_Teachers"
## [36] "Perc_Over_Age_50_Teachers"
## [37] "Perc_QTS_Teachers"
## [38] "Perc_of_Unqual_Teachers_who_Unqual_on_QTS_Route"
## [39] "Perc_Male_TAs"
## [40] "Perc_Minority_Ethnic_TAs"
## [41] "Perc_HLTA_TAs"
## [42] "Perc_Male_Non_Classroom_Support_Staff"
## [43] "Perc_Minority_Ethnic_Non_Classroom_Support_Staff"
## [44] "Perc_Male_Aux_Staff"
## [45] "Perc_Minority_Ethnic_Aux_Staff"
## [46] "Regional_Pay_Spine"
## [47] "Mean_Gross_Salary_All_Teachers_Sterling"
## [48] "Perc_Main_Pay_Range_Classroom_Teachers"
## [49] "Perc_Upper_Pay_Range_Leading_Practioners_Pay_Range_Classroom_Teachers"
## [50] "Perc_Receive_Allowance_Qual_Classroom_Teachers"
## [51] "Perc_Leadership_Pay_Range_Teachers"
## [52] "Perc_At_Least_One_Sickness_Absence_Teachers"
## [53] "Tot_Days_Sickness_Absence"
## [54] "Mean_Days_Lost_Teacher_Sickness_Absence_Of_Those_Taking_Sickness_Absence"
## [55] "Mean_Days_Lost_Teacher_Sickness_Absence_All_Teachers"
## [56] "FT_Vacant_Posts"
## [57] "Perc_FT_Posts_Vacant"
## [58] "FT_Temp_Filled_Posts"
## [59] "Perc_FT_Temp_Filled_Posts"
Tip:
Click on the name of a dataframe to open it in a new window. Click on the arrow to the left of it to see the structure of it, or type str(DATAFRAME_NAME)
s
We can also change column names using the above function:
5.3 Selecting certain columns/removing columns
Selecting certain columns is really helpful for creating subset dataframes. Below we select the school’s LA Establishment Code, its unique identifier, and all the columns to do to do with teacher absences:
Let’s break this down:
- teacher_absences is the name of the new dataframe we’re going to create
- We’ve seen the get sign before
- swfc_16 is the dataframe we’re going to select columns from
- [] is for selecting a certain element from an object
- c() stipulates a character string, in this instance a load of numbers
- x:y means all numbers between and including x and y
- The comma and nothing before it signifies that all rows must be included - the format is dataframe[row conditions, column conditions]
Whilst we can use this method to remove columns, we can also remove individual columns using a simpler method. Say for instance we wanted to remove the 2016 LA Establishment Code, the third column:
5.4 Conditional Selections
We can select subsets of dataframes based on certain conditions. There are a number of ways to do it, but this method uses functions in the basic R set of functions, known as ‘base R’:
Let’s break this down:
- swfc_16_pri is the name of the new dataframe we’re going to create
- We’ve seen the get sign before
- swfc_16 is the dataframe we’re going to conditionally select rows from
- [] is for selecting a certain element from an object
- $ is for extracting an element by name, in this instance the School_Phase column
- ==“Primary” signifies that rows must equal Primary
- The comma and then nothing after it signifies that all columns must be included - the format is dataframe[row conditions, column conditions]
Activity A5.1: Use conditional selections to create a new dataframe which contains all schools whose school type is an academy. Open the dataframe to see what an academy is actually called in School_Type.
Tip:
After typing the dollar sign when looking for dataframe column names, pause, and a dropdown list will appear.
We can also use selections on numerical variables too:
However, we’re not limited to just one condition:
#Conditionally select schools where Pupil:Teacher Ratios are below 20 and above or equal to 10
swfc_16_ptr <- swfc_16[(swfc_16$Pupil_Teacher_Ratio < 20 & swfc_16$Pupil_Teacher_Ratio >=10),]
#Conditionally select schools where Pupil:Teacher Ratios are below 10 or their LA is in Camden
swfc_16_ptr_camden <- swfc_16[(swfc_16$Pupil_Teacher_Ratio < 10 | swfc_16$LA_Name == "Camden"),]
Activity A5.2: Select all schools whose StatutoryLowAge is higher than 5 and have no full time posts vacant (the fourth from last column).
5.5 Altering data in dataframes
Editing dataframes is a key skill. We can edit the data within columns, or create new ones.
Here we edit the Religious_Character to be TRUE or FALSE. The Religious_Character column is a column of factors - strings limited to a certain number of entries. We will first turn it into a column which can contain any string, called a character column.
Tip:
If a column is a factor, you can see what the different entries are through levels(dataframe$column).
#Turn Religious.Character to binary
swfc_16$Religious_Character <- as.character(swfc_16$Religious_Character)
swfc_16$Religious_Character[swfc_16$Religious_Character == 'Does not apply' |
swfc_16$Religious_Character == 'None' |
swfc_16$Religious_Character == ""] <- FALSE
swfc_16$Religious_Character[swfc_16$Religious_Character != FALSE] <- TRUE
This uses boolean logic (TRUE or FALSE). It also uses != which means does not equal.
We can calculate a new column too. In this instance we’ll work out the percentage of teaching staff that are vacancies.
#Calculate percentage of posts which are vacancies
swfc_16$perc_vacancies <- swfc_16$FT_Vacant_Posts/swfc_16$Tot_Teachers_HC
Activity A5.3: Turn all schools which arent an LA maintained school or a special school into ‘Not LA maintained’.
5.6 Writing data
Activity A5.4: The function for writing is write.csv(). Use the help function (?write.csv) to work out what the arguments are for this function