Code Craft #3 - Anomaly Detection with Isolation Forests
Welcome to another exciting installment of Code Craft! In this blog, we'll explore anomaly detection using Isolation Forests. Anomaly detection is critical in various domains, including fraud detection, network security, and quality control. Today, we'll dive into the theory behind Isolation Forests, perform exploratory data analysis, build and train our model, and visualize the results using Python.
Anomaly detection involves identifying data points that deviate significantly from the norm. These outliers can indicate rare events, errors, or fraudulent activities. Anomaly detection helps maintain data quality and detect unusual patterns in various applications.
Isolation Forests, introduced by Liu, Ting, and Zhou in 2008, is a powerful method for anomaly detection. The algorithm operates on the principle that anomalies are few and different, making them easier to isolate. It uses an ensemble of isolation trees (iTrees) to partition the data randomly. Anomalies, being distinct, require fewer splits to be isolated, resulting in shorter path lengths in the tree.
To demonstrate anomaly detection, we'll generate a synthetic dataset of salaries with some anomalies.
import numpy as np
import pandas as pd
# Set the random seed for reproducibility
np.random.seed(42)
# Generate normal salary data
normal_salaries = np.random.normal(loc=50000, scale=15000, size=1000)
# Introduce anomalies
anomalies = np.random.uniform(low=200000, high=300000, size=10)
anomalies = np.append(anomalies, np.random.uniform(low=10000, high=20000, size=10))
# Combine normal salaries and anomalies
salaries = np.concatenate([normal_salaries, anomalies])
# Create a DataFrame
df = pd.DataFrame({'salary': salaries})
# Save to a CSV file
df.to_csv('salary.csv', index=False)
# Display the first few rows of the dataset
df.head(10)
Before building our model, let's perform some exploratory data analysis to understand the distribution of our salary data.
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv('salary.csv')
# Violin plot to visualize salary distribution
sns.violinplot(x=df['salary'])
plt.title('Violin Plot of Salary Data')
plt.show()
Now, let's build and train our Isolation Forest model using the IsolationForest
class from scikit-learn.
from sklearn.ensemble import IsolationForest
# Define and fit the model
model = IsolationForest(n_estimators=100, max_samples='auto', contamination=0.1, max_features=1.0, random_state=42)
model.fit(df[['salary']])
Once our model is trained, we can use it to predict anomalies. The predict
method assigns a score to each data point, where -1 indicates an anomaly.
# Predict anomalies
df['anomaly'] = model.predict(df[['salary']])
df['anomaly'] = df['anomaly'].apply(lambda x: 'Anomaly' if x == -1 else 'Normal')
df.head(10)
Finally, let's visualize our results to see where the anomalies lie in our data.
# Visualize anomalies
sns.scatterplot(x=range(len(df)), y='salary', hue='anomaly', data=df, palette=['red', 'blue'])
plt.title('Anomalies in Salary Data')
plt.show()
In this blog, we've covered the theory behind Isolation Forests, generated a synthetic dataset, performed exploratory data analysis, built and trained an Isolation Forest model, and visualized the anomalies. Isolation Forests are a powerful tool for detecting anomalies in various datasets, and we hope this tutorial helps you understand and apply this technique in your projects.