Weight of Evidence (WoE) Encoding — with Python Code
Weight of Evidence (WOE) quantifies the strength of the relationship between a categorical independent variable (predictor) and a binary target variable (response) by calculating the logarithm of the odds ratio.
The WoE value quantifies the relationship between a category and the target variable. It measures how well the category predicts the positive (1) or negative (0) class of the target variable.
The formula for calculating the Weight of Evidence (WoE) for a category or group within a categorical variable is as follows:
The WOE value can be positive or negative:
- If WOE>0, it indicates that the category is associated with a higher likelihood of the positive event (good outcome).
- If WOE<0, it indicates that the category is associated with a higher likelihood of the negative event (bad outcome).
- If WOE=0, it suggests that the category has no discriminatory power between the positive and negative events.
When to Use WOE:
- Binary Classification Problems: WOE is most commonly used in binary classification problems where you have a binary target variable (0 or 1) and you want to assess the predictive power of categorical independent variables (features) on this binary target.
- Categorical Variables: WOE is beneficial when dealing with categorical variables with multiple levels or categories. It helps transform these variables into a numeric form that can be directly used in machine learning models like logistic regression.
- Feature Selection: WOE can be used as a feature engineering technique to select the most informative categories within a categorical variable. This helps reduce dimensionality and improve model performance.
- Handling Missing Values: WOE can be used to handle missing values within categorical variables. You can create a separate category or bin for missing values and calculate its WOE.
- Addressing Class Imbalance: When dealing with imbalanced datasets, especially in credit scoring or fraud detection, WOE can help capture the characteristics of the minority class effectively.
- Collinearity: WOE can be a useful technique to address collinearity issues within categorical variables by grouping similar categories together based on their impact on the target.
Example 1: Simple WOE Calculation
Here, we calculate the WOE for two categories ‘A’ and ‘B’ directly from the data.
import pandas as pd
# Sample data
data = pd.DataFrame({'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Target': [1, 0, 1, 1, 0, 1]})
# Calculate WOE for Category 'A' and 'B'
category_counts = data['Category'].value_counts()
category_counts_pos = data[data['Target'] == 1]['Category'].value_counts()
category_counts_neg = data[data['Target'] == 0]['Category'].value_counts()
# Calculate WOE
woe_A = np.log((category_counts_pos['A'] / category_counts['A']) / (category_counts_neg['A'] / category_counts['A']))
woe_B = np.log((category_counts_pos['B'] / category_counts['B']) / (category_counts_neg['B'] / category_counts['B']))
print(f'WOE for Category A: {woe_A:.2f}')
print(f'WOE for Category B: {woe_B:.2f}')
Example 2: WOE Calculation with Binning
Here, we first bin the continuous variable ‘Age’ into categories and then calculate WOE for each age bin.
import pandas as pd
import numpy as np
# Sample data
data = pd.DataFrame({'Age': [25, 30, 35, 40, 45, 50, 55, 60],
'Target': [1, 0, 1, 0, 1, 0, 0, 1]})
# Create age bins
bins = [0, 35, 45, 55, np.inf]
labels = ['<35', '35-45', '45-55', '55+']
data['Age_Bin'] = pd.cut(data['Age'], bins=bins, labels=labels)
# Calculate WOE for each age bin
def calculate_woe(df, col, target_col):
category_counts = df[col].value_counts()
category_counts_pos = df[df[target_col] == 1][col].value_counts()
category_counts_neg = df[df[target_col] == 0][col].value_counts()
woe_values = {}
for category in category_counts.index:
woe = np.log((category_counts_pos.get(category, 0) / category_counts[category]) /
(category_counts_neg.get(category, 0) / category_counts[category]))
woe_values[category] = woe
return woe_values
woe_age = calculate_woe(data, 'Age_Bin', 'Target')
print("WOE values for Age Bins:")
for category, woe in woe_age.items():
print(f'{category}: {woe:.2f}')
Example 3: WOE Calculation with Missing Values
import pandas as pd
import numpy as np
# Sample data with missing values
data = pd.DataFrame({
'Category': ['A', 'B', 'A', 'B', 'A', 'C', 'B', 'A', 'C', np.nan],
'Target': [1, 0, 1, 1, 0, 1, 0, 1, 0, 1]
})
# Replace missing values with a placeholder (e.g., 'Missing')
data['Category'].fillna('Missing', inplace=True)
# Calculate WOE for each category including 'Missing'
def calculate_woe(df, col, target_col):
category_counts = df[col].value_counts()
category_counts_pos = df[df[target_col] == 1][col].value_counts()
category_counts_neg = df[df[target_col] == 0][col].value_counts()
woe_values = {}
for category in category_counts.index:
woe = np.log((category_counts_pos.get(category, 0) / category_counts[category]) /
(category_counts_neg.get(category, 0) / category_counts[category]))
woe_values[category] = woe
return woe_values
woe_category = calculate_woe(data, 'Category', 'Target')
print("WOE values for Categories:")
for category, woe in woe_category.items():
print(f'{category}: {woe:.2f}')
In this code:
- We start with a DataFrame containing a categorical variable ‘Category’ and a binary target variable ‘Target.’ There is also a missing value represented as
np.nan
in the 'Category' column. - We replace the missing values with a placeholder category ‘Missing’ using
fillna()
. - We then calculate the WOE for each category, including ‘Missing’, using the
calculate_woe
function. - The
calculate_woe
function calculates WOE for each category based on the number of positive and negative instances in the target variable. - Finally, we print out the WOE values for each category, including the one representing missing values.