Regression with Missing Data
This article presents a simple way of accommodating missing data in linear regression without excluding part of the sample yet while also obtaining unbiased estimates of predictor effects on the outcome
Regression with missing data
Performing linear regression with missing data is a frequent problem. The easiest option is to ignore the missing data, which will lead to listwise deletion: incomplete cases are simply dropped from the analysis, as if they had no usable data.
There are several drawbacks to listwise deletion:
loss of statistical power to detect effects (we are less likely to detect effects that are actually present in the population, because of the reduced sample size)
the estimates of the effect of each predictor on the outcome can be biased, meaning that on average, the estimates obtained do not correspond to the actual effect in the population
If different people have missing data on different predictors, then adding predictors further compounds these problems (more people are dropped from our sample, and estimates are more likely to be biased).
Types of Missing Data
There are three types of missing data:
Missing completely at random (MCAR): The values are missing randomly and cannot be predicted by any variable, measured or unmeasured
Missing at random (MAR): The values are missing randomly conditional on variables we have included in our model
For example, we have access to the sex of each person, and in our experiment men tended to respond less frequently than women to a specific question
Missing not at random (MNAR): The missing values depend on the value that would have been observed had it not been missing
For example, if people with lower income are less likely to report their income than people who have higher income
Listwise deletion will lead to a loss of statistical power in all three situations. Also, listwise deletion will lead to biased estimates when the data are both MAR and MNAR; the only situation in which listwise deletion leads only to a loss of statistical power is when the data are MCAR, arguably a rare situation.
The method that we describe below often yields greater statistical power than listwise deletion in all three situations. Also, the method yields unbiased estimates when the data are both MCAR and MAR. Moreover, simulation studies have shown that the method below is often preferrable to listwise deletion even when the data are MNAR. For this reason, we recommend the method below for linear regression whenever there are missing data in the variables of interest.
The solution: Full-Information Maximum Likelihood (FIML)
Parameter estimates in linear regression are obtained from an estimation method called maximum likelihood, which finds the parameters that are most likely to have generated the observed sample data. In linear regression, the maximum likelihood estimates correspond to the OLS estimates, which are readily obtained through formulas (with the exception that variances are divided by n-1 in OLS instead of n).
Maximum likelihood estimation has been extended to accommodate missing data. This version is called full-information maximum likelihood (FIML). FIML finds the parameters that are most likely to have generated the observed data separately for each pattern of missingness. Each person in the sample gets a likelihood for each possible parameter value, and this calculation uses that person's observed data (excluding variables with missing values); the resulting likelihood for each parameter value is the sum of the individual likelihoods. The parameter values with the highest likelihoods are the final estimates.
How to do it in R
In this article, we show how to perform FIML estimation in R using the lavaan package. FIML can also be obtained in other structural equation modeling (SEM) software, like Mplus, Amos, OpenMx (also in R), LISREL, and others. FIML is probably not as easily achieved in Python, unless you program your own FIML estimator.
The code below shows how to predict variable y from predictors x1 and x2 using OLS regression with lm() and FIML regression with sem():
OLS Regression
fit <- lm(y ~ x1 + x2, data=df)
summary(fit)
FIML Regression
install.packages("lavaan")
library(lavaan)
fit <- sem("y ~ x1 + x2", data=df, missing="ml.x")
summary(fit) # with options, like standardized=TRUE
Note that the option missing="ml.x" is the option that specifies FIML estimation, allowing for missing data on predictors as well. By default the sem() function performs listwise deletion, like lm(). The package documentation explains these options clearly.
Example: OLS VS FIML regression
To illustrate the two analyses in action, we use real data from one of our clients. These data come from a survey administered to about 250 families who used medical services from a clinic. The survey asks about the families' satisfaction with the clinic. Here, we focus on the subscale looking at satisfaction with the providers' empathy.
Our client was interested in the effect of clinic characteristics (e.g., wait time to get an appointment) and family characteristics (e.g., father education level) on satisfaction with the clinic (from 1 to 5). There were missing data scattered across all 3 variables (empathy: 6%; wait time: 19%; father education: 11%), with only partial overlap (for example, only 3% had values missing on both predictors).
The results from both OLS regression and FIML regression are shown here:
OLS Regression
Intercept: 4.127 (p < .001)
Wait Time: -0.017 (p = .226)
Father edu.: 0.020 (p = .783)
FIML Regression
Intercept: 4.196 (p < .001)
Wait Time: -0.024 (p = .032)
Father edu.: 0.009 (p = .890)
The results for both analyses are very similar (all effects are in the same direction), with one exception (in bold): We were able to detect the negative effect of a longer wait time on satisfaction with the clinic when accommodating the missing data through FIML regression. This is in part due to losing 73 families (about 28% of our total sample) in the traditional OLS regression due to missing data.
Bonus: OLS vs FIML with Complete Data
As a bonus, to convince ourselves that OLS regression estimates are also obtained through maximum likelihood estimation, here are the results of both regressions when also performing listwise deletion with FIML regression—they are practically identical!
OLS Regression
Intercept: 4.127 (p < .001)
Wait Time: -0.017 (p = .226)
Father edu.: 0.020 (p = .783)
FIML Reg. (listwise deletion)
Intercept: 4.127 (p < .001)
Wait Time: -0.017 (p = .221)
Father edu.: 0.020 (p = .781)
Alternative solution: Multiple imputation
An alternative solution to the missing data problem in regression is multiple imputation. Multiple imputation generates plausible guesses for each missing value and imputes them several times (hence the name). The analysis (in our case, regression) is run on each of them (which can be time-consuming for some analyses), and the results are pooled. However, it is less convenient than FIML for at least one reason: The results change from one run to the next, due to the randomness embedded in the generation of plausible values for each missing value.
To perform multiple imputation, a great tool is the package mice in R.
Both FIML and multiple imputation require us to make a decision about which variables are related to the missing data, whether or not such variables are included in the regression. In FIML estimation, variables related to missingness should be included through auxiliary variables; in multiple imputation, these variables should be included in the imputation model. A good reference for further reading is provided below.
Further reading
A complete reference for FIML (and multiple imputation), with clear explanations and simple numerical examples, is this book:
Enders, C. K. (2010). Applied missing data analysis. New York: Guilford Press.