Understanding Pearson Correlation

Pearson Correlation is a measure of linear dependence of a variable with dependent variable.

3 min readOct 3, 2023

Pearson correlation, also known as Pearson’s correlation coefficient, measures the linear relationship between two continuous variables.

It quantifies the degree to which two variables change together, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation.

Formula:

To compute the Pearson correlation coefficient in Python, you can use libraries like NumPy or SciPy, as shown in the previous examples.

Here’s the formula again, implemented in Python code:

import numpy as np

# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 1, 3, 5])

# Calculate Pearson correlation using the formula
mean_x = np.mean(x)
mean_y = np.mean(y)

numerator = np.sum((x - mean_x) * (y - mean_y))
denominator_x = np.sqrt(np.sum((x - mean_x) ** 2))
denominator_y = np.sqrt(np.sum((y - mean_y) ** 2))

pearson_corr = numerator / (denominator_x * denominator_y)

# Print the Pearson correlation coefficient
print(f"Pearson Correlation Coefficient: {pearson_corr:.2f}")

In this code:

We start with sample data in the arrays x and y.
We calculate the means of x and y using np.mean.
We compute the numerator, which is the sum of the product of the deviations of x and y from their means.
We calculate the denominators, which are the square roots of the sums of the squared deviations of x and y from their means.
Finally, we calculate the Pearson correlation coefficient using the formula and print the result.

You can replace the x and y arrays with your own dataset to compute the Pearson correlation coefficient for your specific data.

Interpretation:

Interpreting Pearson correlation involves understanding the strength and direction of the linear relationship between two variables.

Here’s a Python code example that calculates the Pearson correlation coefficient and provides an interpretation:

import numpy as np
from scipy.stats import pearsonr

# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 1, 3, 5])

# Calculate Pearson correlation
pearson_corr, _ = pearsonr(x, y)

# Interpretation
if pearson_corr > 0:
    interpretation = "There is a positive linear relationship between x and y."
elif pearson_corr < 0:
    interpretation = "There is a negative linear relationship between x and y."
else:
    interpretation = "There is no linear relationship between x and y."

# Print the correlation coefficient and interpretation
print(f"Pearson Correlation Coefficient: {pearson_corr:.2f}")
print(f"Interpretation: {interpretation}")

In this code:

We calculate the Pearson correlation coefficient between the sample data x and y.

We then interpret the correlation result based on its sign:

If pearson_corr is positive, it indicates a positive linear relationship.
If pearson_corr is negative, it indicates a negative linear relationship.
If pearson_corr is zero, it indicates no linear relationship.

Finally, we print the correlation coefficient and interpretation.

You can replace the x and y arrays with your own dataset to interpret the Pearson correlation coefficient for your specific data.

Advantages of Pearson Correlation:

Easily interpretable: The Pearson correlation coefficient is straightforward to interpret, as it measures the strength and direction of a linear relationship between variables. It is the most commonly used correlation coefficient and is well-understood in statistics.
Sensitive to linear relationships: It is sensitive to both the magnitude and direction of linear relationships between variables.

Disadvantages of Pearson Correlation:

Limited to linear relationships: Pearson correlation assumes a linear relationship, so it may not capture nonlinear associations between variables effectively.
Sensitive to outliers: Outliers can have a significant impact on Pearson correlation, potentially leading to misleading results.
Requires continuous data: It is suitable for continuous variables and may not work well with categorical or ordinal data.