Missing Value Treatment
Missing value treatment is a critical step in data preprocessing to ensure that missing data does not adversely affect the performance of machine learning models.
Properly handling missing values prevents biased or inaccurate models and improve their performance. Failing to address missing values can lead to incorrect predictions, longer training times, and a higher risk of overfitting.
Basic Missing Value Treatment Methods:
1. Deletion of Rows with Missing Values (Listwise Deletion):
- Remove rows or columns with missing values.
- Python Code (Removing Rows):
df.dropna(inplace=True)
2. Imputation with a Constant Value:
- Replace missing values with a constant.
df['Column_Name'].fillna(value, inplace=True)
3. Imputation with Mean, Median, or Mode:
- Replace missing values with the mean, median, or mode of the column.
mean_value = df['Column_Name'].mean()
df['Column_Name'].fillna(mean_value, inplace=True)
4. Forward Fill (Last Observation Carried Forward — LOCF):
- Replace missing values with the most recent non-missing value.
df['Column_Name'].fillna(method='ffill', inplace=True)
5. Backward Fill (Next Observation Carried Backward — NOCB):
- Replace missing values with the next available non-missing value.
df['Column_Name'].fillna(method='bfill', inplace=True)
6. Interpolation:
- Replace missing values by interpolating between adjacent values.
df['Column_Name'].interpolate(method='linear', inplace=True)
7. Imputation with K-Nearest Neighbors (KNN):
- Replace missing values with values from the K-nearest neighbors in the feature space.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_filled = imputer.fit_transform(df)
df = pd.DataFrame(df_filled, columns=df.columns)
8. Imputation with Predictive Models:
- Train a predictive model (e.g., regression) to predict missing values based on other features.
from sklearn.linear_model import LinearRegression
# Split the data into two sets: one with missing values and one without
train_data = df[df['Column_Name'].notna()]
test_data = df[df['Column_Name'].isna()]
# Train a regression model to predict missing values
X_train = train_data.drop(columns=['Column_Name'])
y_train = train_data['Column_Name']
model = LinearRegression()
model.fit(X_train, y_train)
# Predict missing values
X_test = test_data.drop(columns=['Column_Name'])
predicted_values = model.predict(X_test)
# Fill missing values with predicted values
test_data['Column_Name'] = predicted_values
df = pd.concat([train_data, test_data])
9. Imputation with Domain-Specific Values:
- Replace missing values with domain-specific values based on expert knowledge.
df['Column_Name'].fillna(domain_value, inplace=True)
10. Imputation with Mode for Categorical Data:
- Replace missing values in categorical variables with the mode (most frequent category).
- This should be used very carefully as it can change the distribution. Alternative approach could be if “missing values” can be considered as a seperate category.
mode_value = df['Category_Column'].mode().values[0]
df['Category_Column'].fillna(mode_value, inplace=True)
Additional Blogs by Author
- Python Function: Type of Arguments in a Function
2. Understanding Python’s Try-Except Statements: A Safety Net for Your Code
3. Exploring Python Classes and Object-Oriented Programming
4. Lambda Functions in Python
5. Python Pandas: Creative Data Manipulation and Analysis
6. Decision Tree — Entropy and Information Gain for 3 Outcomes
7. Random Forests Algorithm — A simple guide