Chapter 1 Why R?
1.1 Current Workflow
A typical analytical workflow in our department might involve SQL, Excel and Word. Typical steps might be:
- Query a database with SQL code using SQL Server Management Studio
- QA this code
- Copy and paste the output into Excel
- Process the data in Excel
- Produce outputs (tables, plots, etc) manually in Excel
- QA your Excel file(s)
- Copy and paste outputs into a Word document
- QA the Word document
- You notice an error
- Debug somehow (go back to step 1?)
There are three main reasons why this isn’t ideal. It’s:
- got a high chance of producing errors
- difficult to reproduce your work (what order were the steps in your workflow?)
- time consuming (many steps, lots of wasted time)
So, let’s discuss what we mean by ‘errors’. This is mostly a problem with spreadsheets and moving data in and out of them. You:
- might alter data without realising
- could copy-paste data or formulae incorrectly with little record of what steps you took (e.g. this high profile case)
- might not realise how frequent these errors are until they get embarrassing
In terms of reproducibility, you don’t have a record of the order of doing things and therefore it’s not easy to backtrack on mistakes. A lot of documentation and commenting is required within and across multiple files to ensure that the workflow can be replicated. Typically, this is not always the case. If you write reproducible code, it may also be easier to automate it. This in turn can help free-up time for other, perhaps less trivial, tasks. For example, the Reproducible Analytical Pipeline (RAP) approach helps reduce error and speed up the process of producing official statistics.
Obviously the process takes time because you have to copy-paste values from place to place and perform quality assurance across all the files in your workflow. But there’s also the time needed to remember how you did the analysis when you’re asked to make changes long after you remember how the process works.
1.2 The bottom line
Our analytical work has a direct impact on policy decisions and therefore it affects young people, parents, learners, schools, teachers and many others.
Above all humans cannot be trusted. Let’s minimise the chance of errors, speed things up and make it easy on our future selves by minimising the chance of doing it wrong in the first place. This means breaking away from spreadsheet addiction.
1.3 R is the answer
What might an optimal analytical workflow look like in R then?
- Run your code
This is simple. R is end-to-end: you can get data in at one end from files or a database and pump it out the other in a report or app, while also having automated testing built in. All from the same script. You also have the opportunity to more easily version your work using tools such as Git and GitHub.
1.4 But what is R?
R is a just another tool for data analysis, in the same way that Excel and SQL are tools for data analysis.
Put simply, R lets you read, wrangle and analyse data and create outputs such as graphics, documents and interactive apps. R is a coding language, which means you use it to write instructions for the computer to perform. This allows for fine control of what you want to do.
You can think of R as a place where data is abstracted away and the instructions are brought to the forefront, whereas spreadsheets are where data is at the forefront and the instructions are abstracted away (I heard this somewhere but can’t remember the source; let me know).
RStudio is simply a very useful interface for R that provides a whole bunch of useful bells and whistles.
What’s great about R? It’s:
- free
- available on our work laptops via Software Center
- open-source and cross-platform (you can download it for Windows, Mac and Linux machines)
- established and has many high-quality extensions available (‘packages’)
- has a big and active community, both in the department (e.g. Coffee & Coding) and online (e.g. the RStudio Community)
- got a lot of in-built help files
- got a wealth of articles and help online (e.g. the R bloggers feed and via StackOverflow)
- got excellent statistical and graphics capabilities in particular
- the suite of RStudio tools make documentation, teaching and dissemination much easier
I could go on.
1.5 Should I stop using all other tools?
R is not always the answer. We’re not telling you that we must do things in any particular way. For example, you have an urgent request for the minister due in five minutes and you don’t have the experience to do it in R. Excel may be good enough. That’s absolutely fine. The argument here is that we should move towards a more reproducible model, so that when the minister comes back wanting to tweak your calculation you can be confident that you can remember what you did and how you did it.