May 26, 2024 4 min read

Code Craft #2 - Building Recommender Systems from Scratch with Surprise Library!

Welcome to another installment of Code Craft, where we dive deep into the fascinating world of AI and Data Science tools. Today, we're focusing on building recommender systems using the powerful Surprise library. Whether you're new to recommender systems or looking to refine your skills, this blog post will provide a comprehensive guide to understanding, implementing, and evaluating different types of recommender systems.

Introduction to Recommender Systems

Recommender systems are algorithms designed to suggest relevant items to users. These items can range from movies, music, books, and articles to products in an e-commerce store. The primary goal is to provide personalized recommendations that enhance user experience and satisfaction.

Types of Recommender Systems

Content-Based Filtering: Recommends items similar to those the user has shown interest in, based on item features.
Collaborative Filtering: Relies on user behavior and preferences. It can be user-based or item-based.
Hybrid Methods: Combine multiple recommendation techniques to leverage the strengths of each method.

The Surprise Library

Surprise is a Python scikit for building and analyzing recommender systems dealing with explicit rating data. It's simple yet powerful, making it perfect for our needs.

Installing Surprise

pip install scikit-surprise

Importing Surprise

from surprise import Dataset, Reader
from surprise.model_selection import train_test_split

Loading and Preparing Data

Surprise provides a few built-in datasets like the MovieLens dataset. We'll use it for our demonstration. First, we'll load the data and split it into training and test sets.

# Loading the MovieLens dataset
data = Dataset.load_builtin('ml-100k')
trainset, testset = train_test_split(data, test_size=0.25)
print("Data loaded and split into training and test sets.")

Building a Collaborative Filtering Model

We'll begin with building a collaborative filtering model using Singular Value Decomposition (SVD). SVD is a matrix factorization technique that's highly effective for collaborative filtering tasks.

Explanation of SVD

SVD, or Singular Value Decomposition, is a technique used to decompose a matrix into three other matrices. In the context of recommender systems, it helps in reducing the dimensionality of the user-item interaction matrix while preserving the essential features. This decomposition allows us to capture the underlying structure in the data, such as latent factors representing user preferences and item characteristics.

Building and Evaluating the SVD Model

from surprise import SVD
from surprise import accuracy
from surprise.model_selection import cross_validate

# Building and training the SVD model
algo = SVD()
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Training the model on the trainset
algo.fit(trainset)

# Making predictions on the testset
predictions = algo.test(testset)

# Computing and printing RMSE
accuracy.rmse(predictions)

Content-Based Recommendation

Content-based recommendation relies on item features to make suggestions. For instance, if a user likes a particular movie, we recommend other movies with similar attributes.

Implementation

We'll create a small dataset of movies with their genres, use TF-IDF vectorization to convert the genre text data into numerical format, and calculate cosine similarity between movies.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# Sample movie data
movies = pd.DataFrame([
    {'title': 'The Matrix', 'genre': 'Sci-Fi'},
    {'title': 'Inception', 'genre': 'Sci-Fi'},
    {'title': 'Titanic', 'genre': 'Romance'},
    {'title': 'The Godfather', 'genre': 'Crime'},
    {'title': 'Toy Story', 'genre': 'Animation'}
])

# Vectorizing the genre data
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['genre'])

# Calculating cosine similarity
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Function to get movie recommendations
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = movies.index[movies['title'] == title].tolist()[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:4]
    movie_indices = [i[0] for i in sim_scores]
    return movies['title'].iloc[movie_indices]

# Getting recommendations for 'The Matrix'
recommended_movies = get_recommendations('The Matrix')
print("Movies recommended for 'The Matrix':", recommended_movies.tolist())

Hybrid Recommendation

Hybrid recommendation systems combine multiple approaches to leverage their strengths. We'll combine collaborative filtering and content-based filtering to improve the quality of recommendations.

Implementation

We'll build a collaborative filtering model using KNN (K-Nearest Neighbors) and combine its results with our content-based recommendations.

from surprise import KNNBasic

# Building a collaborative filtering model using KNNBasic
sim_options = {'name': 'cosine', 'user_based': False}
algo = KNNBasic(sim_options=sim_options)

# Training the model
algo.fit(trainset)

# Function to get hybrid recommendations
def hybrid_recommendations(user_id, title, cosine_sim=cosine_sim, algo=algo):
    # Content-based part
    content_recommendations = get_recommendations(title)
    
    # Collaborative filtering part
    movie_indices = [movies.index[movies['title'] == title].tolist()[0]]
    cf_recommendations = []
    for idx in range(len(movies)):
        if idx not in movie_indices:
            cf_recommendations.append((movies['title'][idx], algo.predict(user_id, str(idx)).est))
    cf_recommendations = sorted(cf_recommendations, key=lambda x: x[1], reverse=True)[:3]
    cf_recommendations = [rec[0] for rec in cf_recommendations]
    
    # Combining recommendations
    combined_recommendations = list(set(list(content_recommendations) + cf_recommendations))
    return combined_recommendations

# Getting hybrid recommendations for user 196 and 'The Matrix'
hybrid_recommendations_result = hybrid_recommendations(196, 'The Matrix')
print("Hybrid recommendations for user 196 and 'The Matrix':", hybrid_recommendations_result)

Making Predictions

Finally, let's make some predictions. We'll predict the rating a specific user might give to a specific item.

# Predicting the rating for user 196 on item 302
user_id = str(196)
item_id = str(302)
pred = algo.predict(user_id, item_id)
print(f"Predicted rating for user {user_id} on item {item_id}: {pred.est}")

Conclusion

In this blog post, we've built, trained, and evaluated collaborative filtering, content-based, and hybrid recommendation models using the Surprise library. Recommender systems are a powerful tool to personalize user experience, and with libraries like Surprise, building these systems becomes accessible and efficient.

For more details and to see the practical implementation in action, check out the accompanying video linked below. Don't forget to like, subscribe, and leave your comments or questions. Happy coding!

Alister George Luiz

Data Scientist

Dubai, UAE