Missing Value Treatment — Advanced

7 min readOct 8, 2023

In data preprocessing, addressing missing values is a crucial step to ensure the integrity and accuracy of any analysis.

While basic imputation methods like mean or median filling have their place, advanced missing value treatment methods open up a world of possibilities.

These techniques, including MICE, Bayesian imputation, and more, empower data scientists to make informed decisions while accounting for complex relationships and uncertainties within their datasets.

Ten advanced missing value treatment methods:

1. Multiple Imputation using Iterative Imputer:

Generate multiple imputed datasets, analyze each, and pool results for robust estimates.
Use the IterativeImputer from scikit-learn to perform multiple imputation using a sequence of regression models.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10, random_state=0)
df_filled = imputer.fit_transform(df)
df = pd.DataFrame(df_filled, columns=df.columns)

2. Time Series Imputation

Utilize time series methods (e.g., interpolation with time consideration) for missing values in time series data.

Linear Interpolation:

Linear interpolation assumes that values change linearly between two adjacent time points. This method is simple and can be used when the time intervals between data points are relatively constant.

import pandas as pd

# Sample time series data with missing values
data = pd.DataFrame({
    'Date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
    'Value': [10, 15, None, 25, None, 35, 40, None, 50, 55]
})

# Perform linear interpolation to fill missing values
data['Value'].interpolate(method='linear', inplace=True)

Time-Based Rolling Mean:

Fill missing values by calculating a rolling mean (moving average) based on nearby time points. This method can help smooth out irregularities in the data.

import pandas as pd

# Sample time series data with missing values
data = pd.DataFrame({
    'Date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
    'Value': [10, 15, None, 25, None, 35, 40, None, 50, 55]
})

# Calculate a 3-day rolling mean to fill missing values
data['Value'].fillna(data['Value'].rolling(window=3, min_periods=1).mean(), inplace=True)

Time-Based Forward Fill (LOCF — Last Observation Carried Forward):

Fill missing values by carrying forward the last observed value to the missing time points.

import pandas as pd

# Sample time series data with missing values
data = pd.DataFrame({
    'Date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
    'Value': [10, 15, None, 25, None, 35, 40, None, 50, 55]
})

# Perform forward fill to carry forward the last observation
data['Value'].fillna(method='ffill', inplace=True)

3. Data-Driven Imputation

Use clustering or dimensionality reduction to impute missing values based on similar data points.

import pandas as pd
from sklearn.cluster import KMeans

# Sample dataset with missing values
data = pd.DataFrame({
    'Feature1': [1, 2, 3, None, 5, 6, 7, None, 9, 10],
    'Feature2': [11, 12, 13, None, None, 16, 17, 18, None, 20],
})

# Number of clusters for k-means
n_clusters = 3

# Create a copy of the dataset for imputation
imputed_data = data.copy()

# Perform k-means clustering to identify similar data points
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
data['Cluster'] = kmeans.fit_predict(data.drop(columns=['Feature1', 'Feature2']))

# Iterate through each cluster and impute missing values within the cluster
for cluster_label in range(n_clusters):
    cluster_subset = data[data['Cluster'] == cluster_label]
    
    # Calculate the mean of non-missing values in the cluster
    cluster_mean = cluster_subset[['Feature1', 'Feature2']].mean()
    
    # Impute missing values in the imputed_data using the cluster mean
    imputed_data.loc[data['Cluster'] == cluster_label, ['Feature1', 'Feature2']] = cluster_mean

# Remove the 'Cluster' column from the final imputed dataset
imputed_data.drop(columns=['Cluster'], inplace=True)

print("Original Dataset with Missing Values:")
print(data)
print("\nImputed Dataset:")
print(imputed_data)

Input Dataset

   Feature1  Feature2  Cluster
0       1.0      11.0        1
1       2.0      12.0        1
2       3.0      13.0        1
3       NaN       NaN        0
4       5.0       NaN        0
5       6.0      16.0        2
6       7.0      17.0        2
7       NaN      18.0        2
8       9.0       NaN        0
9      10.0      20.0        2


# Imputed DataSet 

   Feature1   Feature2
0       1.0  11.000000
1       2.0  12.000000
2       3.0  13.000000
3       2.0  12.333333
4       5.0  12.333333
5       6.0  16.000000
6       7.0  17.000000
7       6.5  18.000000
8       5.5  12.333333
9      10.0  20.000000

4. Regression Imputation:

Predict missing values using regression models trained on available data.

import pandas as pd
from sklearn.linear_model import LinearRegression

# Sample dataset with missing values
data = pd.DataFrame({
    'X': [1, 2, 3, 4, 5, 6, None, 8, 9, 10],
    'Y': [2, 4, 6, 8, 10, 12, 14, 16, None, 20]
})

# Separate data with missing values and complete data
data_missing = data[data['X'].isna()]
data_complete = data.dropna()

# Create a linear regression model
model = LinearRegression()

# Fit the model on complete data to predict missing values
model.fit(data_complete[['X']], data_complete['Y'])

# Predict missing values
missing_values = model.predict(data_missing[['X']])

# Fill in missing values in the original dataset
data.loc[data['X'].isna(), 'Y'] = missing_values

print("Original Dataset Imputed :")
print(data)

Original Dataset Imputed :

      X     Y
0   1.0   2.0
1   2.0   4.0
2   3.0   6.0
3   4.0   8.0
4   5.0  10.0
5   6.0  12.0
6   7.0  14.0
7   8.0  16.0
8   9.0  18.0
9  10.0  20.0

5. MissForest Algorithm

Utilize the missingpy library to apply the MissForest algorithm, which is a random forest-based imputation method.

from missingpy import MissForest

imputer = MissForest()
df_filled = imputer.fit_transform(df)
df = pd.DataFrame(df_filled, columns=df.columns)

6. Rule-Based Imputation with Custom Functions

Apply specific business rules to impute missing values based on context.
Define custom imputation functions based on domain-specific rules and apply them to fill missing values.

# Define a custom imputation function based on business rules
def custom_imputation(column):
    # Implement your rules and logic here
    return imputed_values

df['Column_Name'].fillna(custom_imputation, inplace=True)

7. MICE (Multivariate Imputation by Chained Equations)

An iterative imputation method that models each variable with missing data as a function of the others.

import pandas as pd
from fancyimpute import IterativeImputer

# Sample dataset with missing values
data = pd.DataFrame({
    'X1': [1, 2, 3, 4, 5, None, 7, 8, 9, 10],
    'X2': [2, None, 6, None, 10, 12, None, 16, 18, 20],
})

# Initialize the MICE imputer
imputer = IterativeImputer()

# Perform MICE imputation on the dataset
imputed_data = imputer.fit_transform(data)

# Convert the imputed data back to a DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=data.columns)

print("Original Dataset with Missing Values:")
print(data)

print("\nImputed Dataset:")
print(imputed_df)

Original Dataset with Missing Values:
     X1    X2
0   1.0   2.0
1   2.0   NaN
2   3.0   6.0
3   4.0   NaN
4   5.0  10.0
5   NaN  12.0
6   7.0   NaN
7   8.0  16.0
8   9.0  18.0
9  10.0  20.0

Imputed Dataset:
     X1    X2
0   1.0   2.0
1   2.0   3.0
2   3.0   6.0
3   4.0   7.0
4   5.0  10.0
5   6.0  12.0
6   7.0  13.0
7   8.0  16.0
8   9.0  18.0
9  10.0  20.0

In this example:

We start with a sample dataset (data) containing missing values in columns 'X1' and 'X2'.
We initialize the MICE imputer using fancyimpute's IterativeImputer.
The fit_transform method of the imputer is used to perform MICE imputation on the dataset.
The imputed data is converted back to a DataFrame (imputed_df).

MICE is a powerful method for imputing missing values, especially when variables are interrelated, as it considers the relationships between variables during imputation.

8. Bayesian Imputation

Use Bayesian models to impute missing values, incorporating uncertainty.
Implement Bayesian models, such as Bayesian regression or Bayesian networks, to impute missing values.
We’ll use the pymc3 library, which is a Python library for Bayesian modeling.

import pandas as pd
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt

# Sample dataset with missing values
data = pd.DataFrame({
    'X': [1, 2, 3, 4, 5, None, 7, 8, 9, 10],
    'Y': [2, None, 6, None, 10, 12, None, 16, 18, 20],
})

# Define a Bayesian Linear Regression model
with pm.Model() as bayesian_model:
    # Priors for the model parameters
    alpha = pm.Normal('alpha', mu=0, sd=10)
    beta = pm.Normal('beta', mu=0, sd=10)
    
    # Likelihood of the data
    mu = alpha + beta * data['X']
    sigma = pm.HalfNormal('sigma', sd=1)
    
    y_obs = pm.Normal('y_obs', mu=mu, sd=sigma, observed=data['Y'].dropna())
    
    # Sample from the posterior distribution
    trace = pm.sample(2000, tune=1000, cores=1)
    
# Predict missing values using the Bayesian model
with bayesian_model:
    y_pred = pm.sample_posterior_predictive(trace, samples=1000)

# Impute missing values by taking the mean of posterior predictions
missing_indices = data['Y'].isnull().values
data.loc[missing_indices, 'Y'] = np.mean(y_pred['y_obs'], axis=0)[missing_indices]

# Plot the posterior predictive distribution
pm.traceplot(trace)
plt.show()

print("Original Dataset with Missing Values:")
print(data)

In this example:

We define a Bayesian Linear Regression model with priors for model parameters (alpha, beta, sigma).
We use MCMC sampling (using the sample method) to estimate the posterior distribution of model parameters.
We perform posterior predictive sampling to generate predictions for missing values.
Missing values in ‘Y’ are imputed using the mean of the posterior predictive samples.
We plot the posterior distributions for model parameters using traceplot.

Note that Bayesian modeling can be computationally intensive, and this example is for educational purposes. Depending on your dataset and the complexity of your Bayesian model, you may need to adjust sampling parameters and model complexity accordingly.

9. Deep Learning-Based Imputation

Employ deep learning models (e.g., autoencoders) for complex imputation tasks.
Time Series Imputation with LSTM (Long Short-Term Memory)

10. Synthetic Data Generation using GANs (Generative Adversarial Networks)

Generate synthetic data to replace missing values while preserving data distribution.

Additional Blogs by Author

Python Function: Type of Arguments in a Function

Python Function: Type of Arguments in a Function

Python function arguments are like giving cooking instructions to a helper: use ‘Positional’ for step-by-step…

ishanjain-ai.medium.com

2. Understanding Python’s Try-Except Statements: A Safety Net for Your Code

Understanding Python’s Try-Except Statements: A Safety Net for Your Code

How Try-Except Helps Handle Errors, Why You Need It, and the Pitfalls to Avoid

ishanjain-ai.medium.com

3. Exploring Python Classes and Object-Oriented Programming

Exploring Python Classes and Object-Oriented Programming

Understanding Classes, Inheritance, Encapsulation, and Static Methods in Python

ishanjain-ai.medium.com

4. Lambda Functions in Python

Lambda Functions in Python

20 creative examples of Lambda Functions for Expressive Coding

ishanjain-ai.medium.com

5. Python Pandas: Creative Data Manipulation and Analysis

Python Pandas: Creative Data Manipulation and Analysis

Python Pandas offers two primary data structures: DataFrame and Series, which are powerful and flexible for data…

ishanjain-ai.medium.com

6. Decision Tree — Entropy and Information Gain for 3 Outcomes

Decision Tree — Entropy and Information Gain for 3 Outcomes

Calculate entropy and information gain using the logarithm base 3 (log of 3)

ishanjain-ai.medium.com

7. Random Forests Algorithm — A simple guide

Random Forests Algorithm — A simple guide

Random Forests is an ensemble learning algorithm that combines multiple decision trees to improve predictive accuracy…

ishanjain-ai.medium.com

Missing Value Treatment — Advanced

Ten advanced missing value treatment methods:

1. Multiple Imputation using Iterative Imputer:

2. Time Series Imputation

3. Data-Driven Imputation

4. Regression Imputation:

5. MissForest Algorithm

6. Rule-Based Imputation with Custom Functions

7. MICE (Multivariate Imputation by Chained Equations)

8. Bayesian Imputation

9. Deep Learning-Based Imputation

10. Synthetic Data Generation using GANs (Generative Adversarial Networks)

Additional Blogs by Author

Python Function: Type of Arguments in a Function

Python function arguments are like giving cooking instructions to a helper: use ‘Positional’ for step-by-step…

Understanding Python’s Try-Except Statements: A Safety Net for Your Code

How Try-Except Helps Handle Errors, Why You Need It, and the Pitfalls to Avoid

Exploring Python Classes and Object-Oriented Programming

Understanding Classes, Inheritance, Encapsulation, and Static Methods in Python

Lambda Functions in Python

20 creative examples of Lambda Functions for Expressive Coding

Python Pandas: Creative Data Manipulation and Analysis

Python Pandas offers two primary data structures: DataFrame and Series, which are powerful and flexible for data…

Decision Tree — Entropy and Information Gain for 3 Outcomes

Calculate entropy and information gain using the logarithm base 3 (log of 3)

Random Forests Algorithm — A simple guide

Random Forests is an ensemble learning algorithm that combines multiple decision trees to improve predictive accuracy…

Written by Ishan | Virginia Tech & IIT Delhi

No responses yet