Outlier Detection Techniques

5 top techniques for Outlier Detection

1. Z-Score:

The Z-Score, also known as the standard score or standardization, is a statistical measure that quantifies how far a data point is from the mean (average) of a dataset in terms of standard deviations.

It is a way to standardize data, making it easier to compare data points from different datasets or to identify outliers.

Points with a high Z-score are considered outliers.

from scipy import stats

z_scores = stats.zscore(data)

outliers = (z_scores > 3) | (z_scores < -3)
#outlier detection

2. IQR (Interquartile Range):

IQR identifies outliers based on the range between the first quartile (Q1) and the third quartile (Q3) of the data.

Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)

IQR = Q3 - Q1

outliers = (data < Q1 - 1.5 * IQR) | (data > Q3 + 1.5 * IQR)
#outlier detection

3. Isolation Forest:

Isolation Forest is an anomaly detection algorithm used in machine learning and data science to identify outliers or anomalies in datasets.

It is a tree-based algorithm that isolates anomalous data points by partitioning the data into subsets or clusters.

Developed by Liu, Ting, and Zhou in 2008, the Isolation Forest algorithm is based on the idea that anomalies are rare and isolated, making them easier to detect.

Here’s how the Isolation Forest algorithm works:

  1. Random Partitioning: The algorithm randomly selects a feature and a random value within the range of that feature to create a partition or split in the data.
  2. Recursive Partitioning: It recursively repeats the random partitioning process, creating a tree-like structure. Each split creates two new branches in the tree.
  3. Isolation of Anomalies: Anomalies are data points that are isolated and require fewer splits to be separated from the rest of the data. This means they are typically located closer to the root of the tree and have shorter path lengths.
  4. Scoring: The algorithm assigns an anomaly score to each data point based on the number of splits required to isolate it. Anomalies have lower scores, indicating that they are easier to separate from the majority of the data.
  5. Thresholding: A threshold is set to determine which data points are considered anomalies. Points with scores below the threshold are classified as anomalies, while those above the threshold are considered normal.
from sklearn.ensemble import IsolationForest

clf = IsolationForest(contamination=0.05)
outliers = clf.fit_predict(data)
#outlier detection

4. Local Outlier Factor (LOF):

Local Outlier Factor (LOF) is an unsupervised anomaly detection algorithm used in machine learning and data science. It is designed to identify local anomalies or outliers within a dataset by considering the density of data points in their local neighborhoods.

LOF is particularly useful for detecting anomalies in datasets where the density of data points varies across different regions.

Here’s how the Local Outlier Factor (LOF) algorithm works:

  1. Local Density Estimation: LOF calculates a local density measure for each data point in the dataset. This density measure is based on the distances between a data point and its k-nearest neighbors. The local density is a way of estimating how crowded or sparse the neighborhood of each data point is.
  2. Density Ratio: For each data point, LOF computes the local outlier factor as the ratio of its local density to the local density of its k-nearest neighbors. A data point is considered an outlier if its local outlier factor is significantly smaller than the local outlier factors of its neighbors. In other words, an outlier is a point that is in a less dense region compared to its neighbors.
  3. Scoring: LOF assigns an anomaly score to each data point, with lower scores indicating potential outliers. Data points with scores significantly below 1 are considered outliers.

Key features of the Local Outlier Factor (LOF) algorithm include:

  • It is a density-based algorithm, which makes it suitable for datasets with varying density.
  • LOF is sensitive to the local context of data points, allowing it to identify anomalies in regions with different densities.
  • It can handle high-dimensional data but may require tuning of the hyperparameter k (number of nearest neighbors)

LOF measures the local deviation of a data point with respect to its neighbors. Points with a low LOF are considered outliers.

from sklearn.neighbors import LocalOutlierFactor

clf = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
outliers = clf.fit_predict(data)
#outlier detection

5. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, is a density-based clustering algorithm used in machine learning and data science. Its primary purpose is to group similar data points together while also identifying noise or outliers in a dataset. DBSCAN is particularly useful for clustering when the clusters have irregular shapes and varying densities.

DBSCAN groups together points that are close to each other and marks points in low-density regions as outliers.

from sklearn.cluster import DBSCAN

clf = DBSCAN(eps=0.5, min_samples=5)
outliers = clf.fit_predict(data)
#outlier detection

--

--

No responses yet