Chapter 5 Base R

R comes with a number of functions that allow you to maniuplate, summarise and analyse data. Collectively these are known as Base R.

This chapter summarises some of the core functions for manipulating data in base R.

5.1 Loading in data

The most common process to load in data is to load in a CSV file.

Tip: Load in and create and initial, raw version of the file, and never do anything to that object. This means if you muck anything up, you’ll always have a clean dataset to start again from. This is particularly important when loading in big dataframes, such as those from SQL.


5.2 Basic dataframe functions

There are a number of functions which can be used to gain a summary of data:

Tip: This worksheet is built using an add on to R called R Markdown, which integrates code, text, and images. We’ll come on to how to create R Markdown documents later, but for now, the box below with a white background and grey border is an output from the above code as it renders in R Markdown. Remember this, it will appear a lot in the worksheet.


##    LA_Number     Establishment_Number  LAEstab_2016    
##  Min.   :201.0   Min.   :1000         Min.   :2013614  
##  1st Qu.:371.0   1st Qu.:2115         1st Qu.:3712144  
##  Median :850.0   Median :3001         Median :8502761  
##  Mean   :697.6   Mean   :3133         Mean   :6979196  
##  3rd Qu.:891.0   3rd Qu.:3605         3rd Qu.:8912352  
##  Max.   :938.0   Max.   :7750         Max.   :9387022  
##                                                        
##           LA_Name           URN        
##  Lancashire   :  632   Min.   :100000  
##  Kent         :  581   1st Qu.:110587  
##  Essex        :  556   Median :120528  
##  Hertfordshire:  535   Mean   :122340  
##  Hampshire    :  531   3rd Qu.:137029  
##  Birmingham   :  447   Max.   :143768  
##  (Other)      :18631                   
##                                School_Name   
##  St Joseph's Catholic Primary School :   57  
##  St Mary's Catholic Primary School   :   41  
##  St Anne's Catholic Primary School   :   20  
##  St Patrick's Catholic Primary School:   20  
##  Holy Family Catholic Primary School :   19  
##  St Peter's Catholic Primary School  :   18  
##  (Other)                             :21738
##   LA_Number Establishment_Number LAEstab_2016    LA_Name    URN
## 1       331                 4004      3314004   Coventry 141104
## 2       332                 2010      3322010     Dudley 103774
## 3       330                 4010      3304010 Birmingham 139788
## 4       332                 2012      3322012     Dudley 103775
## 5       332                 2043      3322043     Dudley 103781
## 6       210                 1025      2101025  Southwark 100768
##                           School_Name
## 1                         Seva School
## 2 Kates Hill Community Primary School
## 3             Waverley Studio College
## 4      Northfield Road Primary School
## 5         Dawley Brook Primary School
## 6          Ann Bernadt Nursery School
##       LA_Number Establishment_Number LAEstab_2016         LA_Name    URN
## 21908       855                 2092      8552092  Leicestershire 143250
## 21909       831                 2424      8312424           Derby 112728
## 21910       882                 2105      8822105 Southend-on-Sea 143341
## 21911       936                 2269      9362269          Surrey 142433
## 21912       931                 2561      9312561     Oxfordshire 137992
## 21913       831                 1009      8311009           Derby 112475
##                          School_Name
## 21908       Newcroft Primary Academy
## 21909        Pear Tree Infant School
## 21910 Thorpe Greenways Infant School
## 21911      Lightwater Village School
## 21912        Faringdon Infant School
## 21913        Walbrook Nursery School
##  [1] "LA_Number"                                                               
##  [2] "Establishment_Number"                                                    
##  [3] "LAEstab_2016"                                                            
##  [4] "LA_Name"                                                                 
##  [5] "URN"                                                                     
##  [6] "School_Name"                                                             
##  [7] "School_Type_Description"                                                 
##  [8] "School_Type"                                                             
##  [9] "School_Phase"                                                            
## [10] "Religious_Character"                                                     
## [11] "Government_Office_Region_Name"                                           
## [12] "Parliamentary_Constituency"                                              
## [13] "LA_District"                                                             
## [14] "Ward"                                                                    
## [15] "StatutoryLowAge"                                                         
## [16] "StatutoryHighAge"                                                        
## [17] "Tot_Workforce_HC"                                                        
## [18] "Tot_Classroom_Teachers_HC"                                               
## [19] "Tot_Teachers_Leadership_HC"                                              
## [20] "Tot_Teachers_HC"                                                         
## [21] "Tot_TAs_HC"                                                              
## [22] "Tot_NonClassroom_Support_Staff_Exc_Aux_Staff_HC"                         
## [23] "Tot_Auxiliary_Staff_HC"                                                  
## [24] "Perc_PT_Teaching_Staff_"                                                 
## [25] "Tot_Workforce_FTE"                                                       
## [26] "Tot_Classroom_Teachers_FTE"                                              
## [27] "Tot_Teachers_Leadership_FTE"                                             
## [28] "Tot_Teachers_FTE"                                                        
## [29] "Tot_TAs_FTE"                                                             
## [30] "Tot_NonClassroom_Support_Staff_Exc_Aux_Staff_FTE"                        
## [31] "Tot_Aux_Staff_FTE"                                                       
## [32] "TA_Teacher_Ratio"                                                        
## [33] "Pupil_Teacher_Ratio"                                                     
## [34] "Perc_Male_Teachers"                                                      
## [35] "Perc_Minority_Ethnic_Teachers"                                           
## [36] "Perc_Over_Age_50_Teachers"                                               
## [37] "Perc_QTS_Teachers"                                                       
## [38] "Perc_of_Unqual_Teachers_who_Unqual_on_QTS_Route"                         
## [39] "Perc_Male_TAs"                                                           
## [40] "Perc_Minority_Ethnic_TAs"                                                
## [41] "Perc_HLTA_TAs"                                                           
## [42] "Perc_Male_Non_Classroom_Support_Staff"                                   
## [43] "Perc_Minority_Ethnic_Non_Classroom_Support_Staff"                        
## [44] "Perc_Male_Aux_Staff"                                                     
## [45] "Perc_Minority_Ethnic_Aux_Staff"                                          
## [46] "Regional_Pay_Spine"                                                      
## [47] "Mean_Gross_Salary_All_Teachers_Sterling"                                 
## [48] "Perc_Main_Pay_Range_Classroom_Teachers"                                  
## [49] "Perc_Upper_Pay_Range_Leading_Practioners_Pay_Range_Classroom_Teachers"   
## [50] "Perc_Receive_Allowance_Qual_Classroom_Teachers"                          
## [51] "Perc_Leadership_Pay_Range_Teachers"                                      
## [52] "Perc_At_Least_One_Sickness_Absence_Teachers"                             
## [53] "Tot_Days_Sickness_Absence"                                               
## [54] "Mean_Days_Lost_Teacher_Sickness_Absence_Of_Those_Taking_Sickness_Absence"
## [55] "Mean_Days_Lost_Teacher_Sickness_Absence_All_Teachers"                    
## [56] "FT_Vacant_Posts"                                                         
## [57] "Perc_FT_Posts_Vacant"                                                    
## [58] "FT_Temp_Filled_Posts"                                                    
## [59] "Perc_FT_Temp_Filled_Posts"

Tip:

Click on the name of a dataframe to open it in a new window. Click on the arrow to the left of it to see the structure of it, or type str(DATAFRAME_NAME)


s

We can also change column names using the above function:

5.3 Selecting certain columns/removing columns

Selecting certain columns is really helpful for creating subset dataframes. Below we select the school’s LA Establishment Code, its unique identifier, and all the columns to do to do with teacher absences:

Let’s break this down:

  1. teacher_absences is the name of the new dataframe we’re going to create
  2. We’ve seen the get sign before
  3. swfc_16 is the dataframe we’re going to select columns from
  4. [] is for selecting a certain element from an object
  5. c() stipulates a character string, in this instance a load of numbers
  6. x:y means all numbers between and including x and y
  7. The comma and nothing before it signifies that all rows must be included - the format is dataframe[row conditions, column conditions]

Whilst we can use this method to remove columns, we can also remove individual columns using a simpler method. Say for instance we wanted to remove the 2016 LA Establishment Code, the third column:

5.4 Conditional Selections

We can select subsets of dataframes based on certain conditions. There are a number of ways to do it, but this method uses functions in the basic R set of functions, known as ‘base R’:

Let’s break this down:

  1. swfc_16_pri is the name of the new dataframe we’re going to create
  2. We’ve seen the get sign before
  3. swfc_16 is the dataframe we’re going to conditionally select rows from
  4. [] is for selecting a certain element from an object
  5. $ is for extracting an element by name, in this instance the School_Phase column
  6. ==“Primary” signifies that rows must equal Primary
  7. The comma and then nothing after it signifies that all columns must be included - the format is dataframe[row conditions, column conditions]

Activity A5.1: Use conditional selections to create a new dataframe which contains all schools whose school type is an academy. Open the dataframe to see what an academy is actually called in School_Type.


Tip:

After typing the dollar sign when looking for dataframe column names, pause, and a dropdown list will appear.


We can also use selections on numerical variables too:

However, we’re not limited to just one condition:

Activity A5.2: Select all schools whose StatutoryLowAge is higher than 5 and have no full time posts vacant (the fourth from last column).


5.5 Altering data in dataframes

Editing dataframes is a key skill. We can edit the data within columns, or create new ones.

Here we edit the Religious_Character to be TRUE or FALSE. The Religious_Character column is a column of factors - strings limited to a certain number of entries. We will first turn it into a column which can contain any string, called a character column.

Tip:

If a column is a factor, you can see what the different entries are through levels(dataframe$column).


This uses boolean logic (TRUE or FALSE). It also uses != which means does not equal.

We can calculate a new column too. In this instance we’ll work out the percentage of teaching staff that are vacancies.

Activity A5.3: Turn all schools which arent an LA maintained school or a special school into ‘Not LA maintained’.


5.6 Writing data

Activity A5.4: The function for writing is write.csv(). Use the help function (?write.csv) to work out what the arguments are for this function