Pipelines in R
This article presents a different way of programming in R, replacing scripts with pipelines
Old approach: Scripting
Most data scientists program in R using a scripting approach, where the code is stored in a standalone file and each line is executed one after the other. This is perfectly okay, but experience quickly shows scripting has at least two major drawbacks:
changing the flow of the code is difficult (e.g. skipping a step, adding a step somewhere in the middle, etc.)
the code is difficult to understand when revisiting the code later or when sharing the code with someone
New approach: Pipelines
We propose a new approach that works well in practice (we have many projects to prove it!).
The main components of the pipeline methodology are:
A function file, that includes:
Library imports
Global variables (if any), such as list of predictor names
User-defined functions with very descriptive names (for data loading and preparation, modeling, etc.)
A pipeline file, which executes functions one after the other
A configuration file (optional), which defines options (such as the outcome variable)
Example
Here is a side-by-side comparison of the scripting approach (script.R) and the pipeline approach (functions.R and pipeline.R) for loading a dataset, replacing impossible values with missing values, and obtaining a summary of the dataset.
script.R
# set working directory
setwd("d:/project/")
# load data
library(readxl)
df <- read_excel("data.xlsx")
df <- as.data.frame(df)
# remove missing
df[df==999] <- NA
# print summary
summary(df)
functions.R
# imports
library(readxl)
# set working directory
set_working_directory <- function()
{
setwd("d:/project/")
print(paste("Current WD:", getwd()))
}
# load data
load_data <- function()
{
df <- read_excel("data.xlsx")
df <- as.data.frame(df)
return(df)
}
# remove missing
replace_missing_with_NA <- function(df, missing_code=999)
{
df[df == missing_code] <- NA
print(paste("Replaced", missing_code, "with NA"))
return(df)
}
pipeline.R
set_working_directory()
df <- load_data()
df <- replace_missing_with_NA(df)
summary(df)
Reusing Functions
The pipeline approach is superficially longer (it has more lines). However, the code is much easier to edit: you can change specific functions or change the steps taken in the pipeline. Also, the workflow (the steps taken sequentially in pipeline.R) is much clearer. These advantages become more and more salient as the workflow becomes complex.
Another advantage of the pipeline approach is that you can reuse functions across projects. Many tasks are repeated across projects (defining a working directory, loading data, preparing data, etc.). Once you've written functions for these common tasks, you can simply copy-paste them into your new project. One good way to store such functions is through gists.
Some ready-to-go functions are available here.