Missing Value Treatment

3 min readOct 8, 2023

Missing value treatment is a critical step in data preprocessing to ensure that missing data does not adversely affect the performance of machine learning models.

Properly handling missing values prevents biased or inaccurate models and improve their performance. Failing to address missing values can lead to incorrect predictions, longer training times, and a higher risk of overfitting.

Basic Missing Value Treatment Methods:

1. Deletion of Rows with Missing Values (Listwise Deletion):

Remove rows or columns with missing values.
Python Code (Removing Rows):

df.dropna(inplace=True)

2. Imputation with a Constant Value:

Replace missing values with a constant.

df['Column_Name'].fillna(value, inplace=True)

3. Imputation with Mean, Median, or Mode:

Replace missing values with the mean, median, or mode of the column.

mean_value = df['Column_Name'].mean()
df['Column_Name'].fillna(mean_value, inplace=True)

4. Forward Fill (Last Observation Carried Forward — LOCF):

Replace missing values with the most recent non-missing value.

df['Column_Name'].fillna(method='ffill', inplace=True)

5. Backward Fill (Next Observation Carried Backward — NOCB):

Replace missing values with the next available non-missing value.

df['Column_Name'].fillna(method='bfill', inplace=True)

6. Interpolation:

Replace missing values by interpolating between adjacent values.

df['Column_Name'].interpolate(method='linear', inplace=True)

7. Imputation with K-Nearest Neighbors (KNN):

Replace missing values with values from the K-nearest neighbors in the feature space.

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
df_filled = imputer.fit_transform(df)
df = pd.DataFrame(df_filled, columns=df.columns)

8. Imputation with Predictive Models:

Train a predictive model (e.g., regression) to predict missing values based on other features.

from sklearn.linear_model import LinearRegression

# Split the data into two sets: one with missing values and one without
train_data = df[df['Column_Name'].notna()]
test_data = df[df['Column_Name'].isna()]

# Train a regression model to predict missing values
X_train = train_data.drop(columns=['Column_Name'])
y_train = train_data['Column_Name']
model = LinearRegression()
model.fit(X_train, y_train)

# Predict missing values
X_test = test_data.drop(columns=['Column_Name'])
predicted_values = model.predict(X_test)

# Fill missing values with predicted values
test_data['Column_Name'] = predicted_values
df = pd.concat([train_data, test_data])

9. Imputation with Domain-Specific Values:

Replace missing values with domain-specific values based on expert knowledge.

df['Column_Name'].fillna(domain_value, inplace=True)

10. Imputation with Mode for Categorical Data:

Replace missing values in categorical variables with the mode (most frequent category).
This should be used very carefully as it can change the distribution. Alternative approach could be if “missing values” can be considered as a seperate category.

mode_value = df['Category_Column'].mode().values[0]
df['Category_Column'].fillna(mode_value, inplace=True)

Additional Blogs by Author

Python Function: Type of Arguments in a Function

Python Function: Type of Arguments in a Function

Python function arguments are like giving cooking instructions to a helper: use ‘Positional’ for step-by-step…

ishanjain-ai.medium.com

2. Understanding Python’s Try-Except Statements: A Safety Net for Your Code

Understanding Python’s Try-Except Statements: A Safety Net for Your Code

How Try-Except Helps Handle Errors, Why You Need It, and the Pitfalls to Avoid

ishanjain-ai.medium.com

3. Exploring Python Classes and Object-Oriented Programming

Exploring Python Classes and Object-Oriented Programming

Understanding Classes, Inheritance, Encapsulation, and Static Methods in Python

ishanjain-ai.medium.com

4. Lambda Functions in Python

Lambda Functions in Python

20 creative examples of Lambda Functions for Expressive Coding

ishanjain-ai.medium.com

5. Python Pandas: Creative Data Manipulation and Analysis

Python Pandas: Creative Data Manipulation and Analysis

Python Pandas offers two primary data structures: DataFrame and Series, which are powerful and flexible for data…

ishanjain-ai.medium.com

6. Decision Tree — Entropy and Information Gain for 3 Outcomes

Decision Tree — Entropy and Information Gain for 3 Outcomes

Calculate entropy and information gain using the logarithm base 3 (log of 3)

ishanjain-ai.medium.com

7. Random Forests Algorithm — A simple guide

Random Forests Algorithm — A simple guide

Random Forests is an ensemble learning algorithm that combines multiple decision trees to improve predictive accuracy…

ishanjain-ai.medium.com

Missing Value Treatment

Basic Missing Value Treatment Methods:

1. Deletion of Rows with Missing Values (Listwise Deletion):

2. Imputation with a Constant Value:

3. Imputation with Mean, Median, or Mode:

4. Forward Fill (Last Observation Carried Forward — LOCF):

5. Backward Fill (Next Observation Carried Backward — NOCB):

6. Interpolation:

7. Imputation with K-Nearest Neighbors (KNN):

8. Imputation with Predictive Models:

9. Imputation with Domain-Specific Values:

10. Imputation with Mode for Categorical Data:

Additional Blogs by Author

Python Function: Type of Arguments in a Function

Python function arguments are like giving cooking instructions to a helper: use ‘Positional’ for step-by-step…

Understanding Python’s Try-Except Statements: A Safety Net for Your Code

How Try-Except Helps Handle Errors, Why You Need It, and the Pitfalls to Avoid

Exploring Python Classes and Object-Oriented Programming

Understanding Classes, Inheritance, Encapsulation, and Static Methods in Python

Lambda Functions in Python

20 creative examples of Lambda Functions for Expressive Coding

Python Pandas: Creative Data Manipulation and Analysis

Python Pandas offers two primary data structures: DataFrame and Series, which are powerful and flexible for data…

Decision Tree — Entropy and Information Gain for 3 Outcomes

Calculate entropy and information gain using the logarithm base 3 (log of 3)

Random Forests Algorithm — A simple guide

Random Forests is an ensemble learning algorithm that combines multiple decision trees to improve predictive accuracy…

Written by Ishan | Virginia Tech & IIT Delhi

No responses yet