Preventing Target Leakage
This article presents a simple way of preventing accidental target leakage in feature engineering and throughout a machine learning pipeline
Target Leakage
In machine learning, target leakage refers to using information about the target for training the model when such information will not be available at prediction time. In other words, the model has access to information relevant to the target during training, but not during testing.
Target leakage can be very explicit, for example when using a transformation of the target as a predictor of the target. Alternatively, target leakage can be insidious and show up in feature engineering, for example when using the target along with other predictors to generate a nearest-neighbor variable.
Why avoid target leakage?
Target leakage should be avoided, because once in production the model will not face the same conditions it faced during training. There are at least two drawbacks of this:
The performance of the model is largely overestimated relative to the performance that will be observed in production
The model did not get a chance to learn the correct relationships between variables (having most likely focused on the information relevant to the target contained in the training set), which may or may not have led to a satisfactory model
A simple way to prevent target leakage
A simple way to prevent target leakage is to remove the target from the training and test sets until needed for training or prediction. This ensures that no target leakage, even accidental (the most insidious kind of target leakage) can occur, even during feature engineering.
The method we propose to achieve this follows these steps:
Create a reference table with 2 columns: a unique ID and the corresponding target value
Replace the target variable with missing values in the training and test sets
When needed, add the target values back to the data frames by joining with the reference table (created in step 1)
For the training set, this should be done prior to training the model (or temporarily for feature engineering)
For the test set, this should be done after generating model predictions to evaluate model performance
Some functions in R that perform these steps on the test set are already available in this gist and can be added directly to your pipeline. (See our article on refactoring your code from traditional scripting to pipelines.)