Chapter 4 Fundamentals: What good code looks like

There is no definitive style that determines if code is well written or not. There’s an excellent resource for styling R code here and it’s worth noting that Rstudio comes with numerous functionalities designed to help with good code structure (see drop down menu in Code). But, there are several features that are consistently found in code, regardless of language, that follow the principles of good practice. Some of these are outlined in the examples below.

4.1 A clear and informative Header

A header should contain, at a minimum, enough information about the code to give the reader an understanding of the task the code is attempting to carry out. It should contain the details of the original author, as well as any subsequent authors, and the dates of any key events – such as the creation date, or the date of a major modification. Headers can also contain information about data sources, libraries or packages (including version) used, or notes on any assumptions that are made in the analysis. If you’re using Version Control software, some of this information may be captured automatically though.

  # VIRTUAL COHORT Propensity Score Matching ANALYSIS – 2014 KS2    
  #
  # Written by:  Anne Analyst
  # On:        26/11/19
  # Last update: 31/12/19
  #
  # Aim:
  #  Using pupil characteristics to identify pupils in LA-maintained schools 
  #  that are similar to pupils in sponsored academies – here “similar” 
  #  means, “have a high propensity to be” – to do this analysis we will use
  #  Propensity Score Matching.
  #
  # Datasets: 2019 KS2 Data
  # This data has been constructed from various datasets on our SQL
  # server (school census, NPD, AcademiesAndSchoolOrganisation)
  #
  # Key Libraries:
  #  The PSM analysis uses the MatchIt library – Version 2.4-21
  #
  /////----------------------------------------------------------------\\\\\
  \\\\\----------------------------------------------------------------/////

4.2 Consistent use of indentation

Indentation is an important mechanism that adds structure and readability to code. Just like you would use grammar – sentences and paragraphs – to improve the readability of your written communication, you can use indentation to improve your code. A large block of code with no structure is as unreadable as a large block of prose in a novel. (Note that in some languages [e.g. Python] indentation also has a functional purpose – your code will not run correctly if you do not get it right.)

Indentation allows the programmer to delineate which part of code a statement belongs to – for example, any code that is encapsulated by an If statement can be clearly marked by indenting the code:

IF (condition is TRUE) THEN
        actions…
    …

    IF (condition2 is TRUE) THEN
          second set of actions…
      …
    END IF
END IF

4.3 Descriptive use of variable names

Variable names should be short but descriptive. E.g. x1 is not really helpful in interpreting the role of the variable, but AverageAgeMale is much more helpful if the task is about looking at population dynamics. Use names that link with the subject matter of the task.

Generally, variable names should be nouns and functions names should be verbs. This isn’t always an easy thing to do!

Try to avoid using names that are already used in standard functions such as the mean() function in R.

Choose a consistent naming convention and stick with it: never mix your conventions. There are several naming conventions in use, including:

  • Use lowercase letters and separate words with underscores (called “Snake case”). E.g. country_of_birth
  • Use lowercase letters and separate words with hyphens (called “Kebab case”). E.g. country-of-birth
  • Use uppercase letters at the start of each word (called “Pascal case”). E.g. CountryOfBirth
  • Use uppercase letters following the first word (called “Camel Case”). E.g. countryOfBirth
  • In certain languages, you can use spaces as separators and brackets as delineators, e.g. [Country of Birth] But, it’s important to note that spaces can lead to issues in some instances so be sure this convention is appropriate to the software you are using before you start. For example, you can do this in R, but you can’t reference the object in a function.

No one is better than any other – the point is to choose one and use it consistently in your code. If you are modifying someone else’s code that has already established a convention, stick to that convention, even if it isn’t the one you’d usually choose.

4.4 Descriptive and informative comments

Use comments to explain the “why”. The “what” and “how” are usually implied by the code itself. An example of well commented code is given below, the comments explain the reasoning behind each operation and allow the reader to understand the context and purpose of the code.

# We need to identify which pupils are eligible for free school meals
#  so we create an FSM_flag that we can filter on. 
filtered_pupils <- mutate(total_pupils,
                          FSM_flag = ifelse(earnings <= criteria,
                                            1,
                                            0)) %>%
    # We’re only interested in those pupils who are on working tax
    #  credits and have an FSM flag
                    filter(benefit == “WTC”, FSM_flag == 1) %>%

    # The previous operations re-ordered our table so we need to 
    #  rearrange
                    arrange(age, academic_year)
        
# We want to put our table in a wide format so that we can plot 
#  some nice outputs
                    pupils_wide <- spread(filtered_pupils, academic_year, benefit)

4.5 A clear flow and structure

Structure code in a logical order – if you can, break your code into chunks. This can make it much easier to follow. Try not to jump around when manipulating different variables/tables – do your data manipulation on one object and then move onto the next; merge and combine objects at a sensible time.

Strive to limit your code to a reasonable number of characters per line. Doing this means your code is easily readable on one page (remember that others may be working with a smaller view window than you), without the need for scrolling, and it can help to identify different input variables within nested functions.

It may be useful to use commented lines, with no text, to break up your file into easily readable chunks.

There’s even more detail on structure here, and a good example of well-structured code is given below. The script is split into 3 sections and clearly broken up.

# 1. Input variables ---------------------------------------------------
# =====================================================================-

# Input the folder name of outputs
folder <- “2019-08-12”

# 2. Load data --------------------------------------------------------
# ====================================================================-
    
# Load the rates from the folder
    rates <- read.csv(paste0(“Model_outputs/”, folder),
                  row.names = 1)

# Load regional codes
    regions <- read.csv(“Inputs/Regional_codes.csv”)

# 3. Manipulate -------------------------------------------------------
# ====================================================================-

# Bind regional codes to rates
    regional_rates <- merge(rates, regions)

4.6 Consistent spacing

Use white space around code lines and operators. White space around code lines helps to break up the code into readable chunks (see previous section on flow and structure).

Most operators (==, +, -, <-, etc.) should always be surrounded by spaces.

Place a space before and after () when used with if, for or while but never inside or outside parentheses of normal functions. Always put a white space after a comma, never before, just like in regular English.

An example of using spaces in the situations discussed above is given below:

# We want to slice a data table for an example to show the use of commas
fruit_table <- food_table[2, 2:6]

# This is a generic if statement
if (fruit == “Orange”) {
    oranges <- crates * crate_stock - wastage
    }

4.7 If required, the use of functions or subroutines

You shouldn’t be repeatedly writing the same code: if you find yourself doing this, stop! It’s much more efficient to create a function or subroutine instead. By doing this you only need to quality assure the function once, rather than your repetitive code many times. It also means that any future changes need to be implemented only in one place rather than many, and can be more efficient for the code to run.

# Create a function for calculating the weighted average
weight_avg <- function(year, year_grp) {
    
  # Arguments:
  #  year – academic year we’re interested in.
  #  year_grp – the NC year we’re interested in.
  
  # Get the total volumes from the stocks we’re using
  total_stocks <- sum(filter(stocks, AY == year)$vol)
  
  # Isolate the flows we are going to be taking the weighted average 
  #  of
  flows <- filter(prot_flows, NC_Year == year_grp, AY == year)
  
  # Matrix multiplication of stocks and flows, divided by total stocks
  #  to get weighted avg
  weight_avg <- (as.matrix(filter(stocks, AY == year)$vol)) %*%
                  as.matrix(flows)) / 
                  total_stocks

}