Chapter 8 Statistical Operations
Easily drilling down into data is one of R’s most powerful functions. As we would with Excel, we can use a number of functions to gain a better understanding of the data.
8.1 Maximum, Minimum, and Range
One of the key checks to do on a dataset when loading data in is what extreme values are in each variable:
- The minimum
- The maximum
- The range
Tip:
Put closing delimiters* on a new line - it’s easier to see which opening delimiter it corresponds to.
*(), {}, []
Now, we can calculate the minimum and maximum for any column we require:
Activity A8.1: It’s coming up with NA. What’s the argument in min or max that we need to add in to return a result?
We could also write this using the pipe:
swfc_16 %>%
select(Pupil_Teacher_Ratio
) %>%
min(ARGUMENT_IN_HERE
)
swfc_16 %>%
select(Pupil_Teacher_Ratio
) %>%
max(ARGUMENT_IN_HERE
)
These two functions are really useful for identifying extreme values and outliers - potentially values which are incorrect or shouldn’t be there.
We can use another function, similar to min and max, called range.
Activity A8.2: Pick a variable and calculate the range. Think about the arguments you need to use.
Tip:
Statistical functions nearly always need to have NA values removed from the object they’re operating on.
8.2 Averages
There are two averages we can calculate:- Mean: This is the ‘average’ that we’re used to - add the values up and divide them by the number of values
- Median: Line them all up in order, count to the middle value (if its an even number of values, go for halfway between the two middle values)
Let’s apply each of these to a subset of the main dataset.
Mean:
#The mean of the total school workforce for primary schools
swfc_16 %>%
filter(School_Phase == 'Primary') %>%
group_by(School_Phase) %>%
summarise(Ave = mean(Tot_Workforce_HC,na.rm=TRUE))
Median:
8.3 Correlations
We can calculate correlations between 2 or more values.
Let’s just start with two variables:
## StatutoryHighAge Tot_Workforce_HC
## StatutoryHighAge 1.0000000 0.6099884
## Tot_Workforce_HC 0.6099884 1.0000000
## StatutoryLowAge StatutoryHighAge
## StatutoryLowAge 1.0000000 0.7454194
## StatutoryHighAge 0.7454194 1.0000000
Activity A8.3: Create an object (a correlation matrix) which has the correlations for all the columns between StatutoryLowAge and Tot_TAs_HC. Assign it to an object name.
8.4 Significance Testing
This isn’t a stats course, but significance testing is a really handy technique for analysing data - the first step in learning statistical techniques in a data analyst’s/scientist’s toolkit, and can be relatively easily executed in R.
In practical terms, significance testing is quantifying how confident we are two groups are different to one another.
Suppose we wanted to test whether primary schools had significantly different total workforces to the school population overall.
t.test(swfc_16 %>% filter(School_Phase == "Primary"
) %>% select(Tot_Workforce_HC),
mu = mean(swfc_16$Tot_Workforce_HC,na.rm=TRUE),
alternative = "less")
t.test(swfc_16[swfc_16$School_Phase == "Primary",17],
mu = mean(swfc_16$Tot_Workforce_HC,na.rm=TRUE),
alternative = "less")
- t.test(): The technique to test for a significant difference is called a T-test - in this instance we’re carrying out a ‘one-tail’ T-test, which in this instance means checking whether the average of a sample significantly differs from the average of the entire population.
- Argument 1: The first argument is the sample that we want to the population against. In this instance it’s primary schools.
- Argument 2: mu is the average of the population.
- Argument 3: We want to test whether the sample average is ‘less’ than population average. This could also be ‘greater’.
##
## One Sample t-test
##
## data: swfc_16 %>% filter(School_Phase == "Primary") %>% select(Tot_Workforce_HC)
## t = -67.484, df = 16683, p-value < 2.2e-16
## alternative hypothesis: true mean is less than 59.69092
## 95 percent confidence interval:
## -Inf 46.84931
## sample estimates:
## mean of x
## 46.52847
Now let’s break the output down (in reverse order from how it’s displayed):
- Mean of x: This is the average of the sample: 46.5287.
- Confidence intervals: 95% of the time (i.e. 19 out of 20 times), the average of any random sample of the overall population will be outside the confidence intervals. Here, because we’re only checking whether the sample’s average than the population’s average, we only need a confidence interval above the mean: 46.84931.
- p-value: To be confident that there is a significant difference, this number needs to be less than 0.05 (i.e. only 1 time out of 20 will a random sample of primary schools we take from the population be smaller than the upper confidence interval.
Simple?!?
Activity A8.4: Test whether schools in Camden LA District have a significantly higher percentage of vacant posts (column name is FT_Vacant_Posts) than England as a whole, using a t-test.
Repeat for lower.
As well as comparing a sample to a population we can compare two samples. What we are doing is testing whether the difference of the averages of two samples is significantly different to zero, i.e. there is a difference.
In this example we’re going to test whether primary schools have a significantly different percentage of teachers who are male to schools that have a phase of ‘All Through’.
t.test(Perc_Male_Teachers ~ School_Phase,
data = (swfc_16 %>% filter(School_Phase == "Primary" |
School_Phase == "All Through") %>%
select(School_Phase,
Perc_Male_Teachers)))
- t-test: As above
- Argument 1: This contains the variable that we’re going to compare the groups on (percentage of teachers that are male), and the characteristic that defines the groups (phase). But there’s more than one phase of school I hear you say…
- Argument 2: Fear not. In the second argument we use some dplyr to do some filtering. We only select schools that are primary or all through, and we only select the school phase and percentage of teachers that are male columns.
##
## Welch Two Sample t-test
##
## data: Perc_Male_Teachers by School_Phase
## t = 22.812, df = 151.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 15.56238 18.51366
## sample estimates:
## mean in group All Through mean in group Primary
## 31.81972 14.78170
Now let’s break the output down:
- Welch: Dunno who Welch is, but their t-test is the standard one for testing two samples
- Group means (at the bottom): These are the means of the two samples that we’re comparing
- Confidence intervals: 95% of the time (i.e. 19 out of 20 times), the average of any random samples taken from each of the groups (primary and all through) will have a difference in their averages of between 15.56238 and 18.51366. Both of those numbers are above 0, so this is looking good…
- p-value: Again, to be confident that there is a significant difference, this number needs to be less than 0.05 (i.e. only 1 time out of 20 will random samples from primary schools and all through schools be outside the confidence intervals above)
Activity A8.5: Test whether schools in Camden and Northumberland LA Districts have significantly different percentage of vacant posts (column name is FT_Vacant_Posts) than England as a whole, using a two sample t-test.
Tip:
For more info on t-tests, go to this page on dummies.com.
Also, try out this page for an interactive two sample t-test calculator if you want a bit more practice.