Nearest Neighbor with Gower in Feature Engineering

This article presents a different method for deriving nearest neighbor(s), and shows how to use it in predictive models through feature engineering

Nearest Neighbor and Feature Engineering

A nearest neighbor is an observation (a row in a dataset) that is most similar to a given observation, given a set of selected features (columns). There are several methods available to find the nearest neighbor of a given observation, like k-nearest neighbors (kNN) and support vector machines (SVM).

Methods also vary in how they calculate the distance between observations (the so-called distance metric), meaning that different methods can disagree on which observation is "most similar" to a given observation. Most distance metrics rely on numerical features only, and cannot accommodate categorical features (or even missing data). This is not the case of the Gower metric presented next.

Gower as the Distance Metric

The Gower distance metric was originally developed to compare strings (i.e., to compute text distances) by comparing each character one by one.

The Gower method is extended to tabular data by comparing the features of two rows one by one. If two rows have the same value on a feature, the distance for that feature is 0. Conversely, if they do not have the same value, then:

the distance is 1 if the feature is categorical (i.e., the two rows do not belong to the same category)
the distance is (value1 - value2) divided by the range of the feature if the feature is numerical.

If one of the two values is missing, then no distance is calculated for that feature (as if the feature did not exist). The overall distance between the two observations is the average distance across features (which can be weighted if some features weigh more than others in the distance calculation).

Two major advantages of the Gower metric relative to other methods like SVM and kNN are:

it uses both numerical and categorical features to compare observations
it accommodates missing data (much in the same way as FIML regression does)

Example Calculation of the Gower Distance

To illustrate how the Gower metric is calculated on real data, consider data on these three rows (which represent organizations):

Let's assume that across all organizations in our data, the number of employees ranges from 5 to 90,000 and the turnover ranges from 0% to 42%. The difference (distance) between Organization #1 and #2 is calculated as:

Similarly, the difference (distance) between Organization #1 and #3 is calculated as:

As can be seen above, categorical features lead to a difference of 0 when the two observations belong to the same category, 1 otherwise. Meanwhile, numerical features lead to a difference that is relative to the full range observed in our data for that feature. Finally, missing data (N/A) are simply ignored and do not pose any problem.

In the example above, all variables are weighted equally (as can be seen in the "Average Difference" column), but this does not need to be the case.

For this particular example (if we had only these three rows), we would conclude that Organization #2 is the organization that is most similar to Organization #1, and is therefore its nearest neighbor.

Gower in Feature Engineering

Once you've determined a nearest neighbor (or several nearest neighbors) for each observation, you can use this information to perform feature engineering and add new variables to your dataset.

In the context of a predictive model, you could, for each row in your dataset:

Find the nearest neighbor in the training set (to avoid target leakage)
Add that neighbor's target value as a new variable

You could do this for several neighbors as well (for example, the 5 closest neighbors). In this case, you would find the 5 nearest neighbors for each row, retrieve these neighbors' target values, and take the modal value (if the target is categorical) or mean or median value (if the target is numeric), and add it to each row as a new variable.

In our experience, adding Gower neighbors to the features of a predictive model often leads to great performance boosts. You can add several of these Gower features (for example, one feature that corresponds to the nearest neighbor's target value, and another feature that corresponds to the 5 nearest neighbors' modal or mean value).

How to do it in R

There are a few different librairies in R that implement the Gower distance metric. We have had great success using the library gower. This library uses two functions (that use C as a backend, making it fast and efficient): one function, gower_dist(), calculates a matrix of distances between pairs of rows in your dataset; and gower_topn(), which returns the row index corresponding to the top n closest neighbors (n can be 1, in which case you obtain each row's nearest neighbor, or a larger integer like 5, in which case you obtain each row's 5 nearest neighbors).

There is a relatively new library in Python, gower, but we haven't had much luck in using it.

Practical Tips for Using Gower in Feature Engineering

In the context of a predictive model, where performance on a test set is key, there are several points to keep in mind.

First, you can play around with the weights of the features that are used in calculating Gower distances. In theory, this is an optimization problem, where the values to be optimized are the weights and the criterion is the model performance on a validation set. For this reason, you could use optim() in R, and specify a custom function to be maximized, which calculates the model performance on the validation set.

In practice however, a much more convenient alternative to obtain near-optimal weights is to:

select weight values at random
compute the model performance on a validation set
repeat tens or hundreds of times and retain a well-performing set of weights.

Second, if categorical variables with k categories are one-hot encoded, these variables will effectively count for as much as k numeric variables in determining the nearest neighbor(s). To avoid this problem, one possibility is to set the weights of categorical variables to 1/k. However, if you follow the steps above to optimize the weights, this should not be a concern.

Third, a related difficulty stems from the way Gower calculates the distance for categorical vs. numerical variables. Since the distance for categorical variables is considered to be 1 (the maximum) whenever the categories are not the same, categorical variables can become more important than numeric variables in determining the nearest neighbor(s), particularly when the ranges of the numeric variables are large. Again, however, if you follow the steps above to optimize the weights, this should not be a concern.

Finally, and most importantly, great care needs to be taken to avoid target leakage. In particular, nearest neighbors need to be picked from the training set only (not the test set), because the target values will not be available at testing time. The target should also be excluded from variables to be considered in the distance calculation. For an easy trick to avoid target leakage entirely, take a look at our article on preventing target leakage in machine learning.