Types of Missing Values in Data
2 min readOct 8, 2023
Missing data in a dataset can occur for various reasons, and the type of missing data can provide insights into how to handle and impute the missing values.
Here are some common types of missing data and how to identify them using Python:
1. Missing Completely at Random (MCAR):
- Definition: Data is missing completely at random if the probability of missingness is the same for all observations, and there is no systematic relationship between the missing data and any other variables.
- Identification: To identify MCAR, you can create a missingness indicator variable and then perform statistical tests to check if this indicator variable is independent of other variables in the dataset.
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
# Create a DataFrame with missing data (example)
data = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, 5]
})
# Create missingness indicator variables
data['A_missing'] = data['A'].isnull().astype(int)
data['B_missing'] = data['B'].isnull().astype(int)
# Perform chi-squared test to check independence
chi2, p, _, _ = chi2_contingency(pd.crosstab(data['A_missing'], data['B_missing']))
if p < 0.05:
print("MCAR hypothesis rejected")
else:
print("MCAR hypothesis not rejected")
2. Missing at Random (MAR):
- Definition: Data is missing at random if the probability of missingness depends on observed data but not on unobserved data.
- Identification: To identify MAR, you can examine relationships between missing and observed variables and use statistical tests to check for systematic patterns.
import pandas as pd
from sklearn.linear_model import LogisticRegression
# Create a DataFrame with missing data (example)
data = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [np.nan, 2, 3, np.nan, 5],
'C': [1, np.nan, 3, 4, np.nan]
})
# Create a binary variable indicating missingness in 'B'
data['B_missing'] = data['B'].isnull().astype(int)
# Train a logistic regression model to predict 'B_missing' based on 'A' and 'C'
model = LogisticRegression()
model.fit(data[['A', 'C']], data['B_missing'])
coefficients = model.coef_
if coefficients[0][0] != 0 or coefficients[0][1] != 0:
print("MAR hypothesis rejected")
else:
print("MAR hypothesis not rejected")
3. Missing Not at Random (MNAR)
- Definition: Data is missing not at random if the probability of missingness depends on unobserved data or the missing values themselves.
- Identification: Identifying MNAR can be challenging, as it often involves making assumptions about the underlying data-generating process. You may need domain knowledge or sensitivity analysis to address MNAR.
Other Articles for your readings:
- Missing Value Treatment
2. Outlier Detection Techniques
3. Python Pandas: Creative Data Manipulation and Analysis