Building a Collaborative Filtering Recommendation System Using PySpark on Databricks

Aadhil imam | Data Scientist | Data Engineer

5 min readAug 4, 2024

Collaborative filtering is a widely used technique in recommendation systems that helps in providing personalized content to users based on their previous interactions and the preferences of similar users. In this article, we will explore how to create a collaborative filtering recommendation system using PySpark on the Databricks platform. We will go through the steps of setting up the environment, loading data, training the model using the Alternating Least Squares (ALS) algorithm, and generating recommendations.

Introduction to Collaborative Filtering

Collaborative filtering is a method of making automatic predictions about the interests of a user by collecting preferences from many users. The underlying assumption of the approach is that if a person A has the same opinion as a person B on one issue, A is more likely to have B’s opinion on a different issue than that of a randomly chosen person.

There are two main types of collaborative filtering:

User-based collaborative filtering: This approach recommends items based on the similarity between users. It analyzes user behavior and suggests items liked by similar users.
Item-based collaborative filtering: This approach recommends items based on the similarity between items. It suggests items similar to those the user has liked or interacted with before.

In this article, we will use the ALS algorithm, which is a matrix factorization technique widely used for item-based collaborative filtering.

Setting Up the Environment

Initializing a Spark Session

Before we can start building the recommendation system, we need to set up a PySpark environment. We will begin by initializing a Spark session, which is the entry point for any Spark application.

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("Collaborative filtering").getOrCreate()

The Spark session allows us to create DataFrames, read data, and perform various data transformations necessary for building our model.

Loading and Preparing Data

Loading Datasets

In this example, we will work with two datasets: movies.csv and ratings.csv. The movies.csv file contains information about the movies, while the ratings.csv file contains user ratings for different movies. We will load these datasets into PySpark DataFrames.

# Load movies and ratings data
moviesDF = spark.read.options(header="True", inferSchema="True").csv("/FileStore/tables/movies.csv")
ratingsDF = spark.read.options(header="True", inferSchema="True").csv("/FileStore/tables/ratings.csv")

Displaying the DataFrames

To verify that the data has been loaded correctly, we can display the content of the DataFrames.

# Display movies dataframe
display(moviesDF)

# Display ratings dataframe
display(ratingsDF)

Joining Datasets

To prepare the data for the recommendation model, we need to join the ratingsDF with the moviesDF on the movieId column. This allows us to have access to movie details along with the ratings.

# Join the datasets on 'movieId'
ratings = ratingsDF.join(moviesDF, 'movieId', 'left')

Splitting the Data into Training and Test Sets

Splitting the dataset into training and test sets is crucial for evaluating the performance of our recommendation model. We’ll use an 80/20 split for training and testing.

# Split the data into training and test sets
(train, test) = ratings.randomSplit([0.8, 0.2])

Counting the Number of Records

We can also count the number of records in the entire dataset, as well as in the training and test sets, to understand the data distribution

# Count total ratings
ratings.count()

# Count and display the number of training records
print(train.count())
train.show()

# Count and display the number of test records
print(test.count())
test.show()

Building the Recommendation Model

Configuring the ALS Model

The ALS (Alternating Least Squares) algorithm is a popular matrix factorization method used in collaborative filtering. We will configure the ALS model by specifying the columns for users, items, and ratings, along with additional parameters:

nonnegative=True: Ensure non-negative ratings.
implicitPrefs=False: Use explicit feedback.
coldStartStrategy="drop": Drop predictions for cold start users or items.

from pyspark.ml.recommendation import ALS

# Configure the ALS model
als = ALS(userCol="userId", 
          itemCol="movieId", 
          ratingCol="rating", 
          nonnegative=True,
          implicitPrefs=False, 
          coldStartStrategy="drop")

Hyperparameter Tuning

To find the best model parameters, we will use grid search for hyperparameter tuning. We specify ranges for the rank (latent factors) and regParam (regularization parameter).

from pyspark.ml.tuning import ParamGridBuilder

# Set up hyperparameter grid
param_grid = ParamGridBuilder() \
    .addGrid(als.rank, [10, 50, 100, 150]) \
    .addGrid(als.regParam, [.01, .05, .1, .15]) \
    .build()

Defining the Evaluator

The evaluator measures the accuracy of our model using a specified metric. Here, we use RMSE (Root Mean Squared Error) to evaluate the prediction accuracy.

from pyspark.ml.evaluation import RegressionEvaluator

# Define evaluator
evaluator = RegressionEvaluator(
    metricName="rmse",
    labelCol="rating",
    predictionCol="prediction")

Performing Cross-Validation

We use cross-validation to automatically train and evaluate the model using different combinations of hyperparameters.

from pyspark.ml.tuning import CrossValidator

# Set up cross-validation
cv = CrossValidator(
    estimator=als, 
    estimatorParamMaps=param_grid, 
    evaluator=evaluator, 
    numFolds=5)

Training the Model

We fit the ALS model using the training data, performing hyperparameter tuning and cross-validation to find the best model.

# Train the model using cross-validation
model = cv.fit(train)
best_model = model.bestModel

Evaluating the Model

After training, we evaluate the best model on the test data using RMSE as the performance metric.

# Evaluate the model on the test data
test_predictions = best_model.transform(test)
RMSE = evaluator.evaluate(test_predictions)
print(RMSE)

Generating Recommendations

Generating Recommendations for All Users

With the trained model, we can generate movie recommendations for all users. In this case, we request the top 5 recommendations per user.

# Generate top 5 movie recommendations for all users
recommendations = best_model.recommendForAllUsers(5)

Displaying Recommendations

To present the recommendations in a more readable format, we can manipulate the DataFrame to explode the recommendations array into separate rows.

from pyspark.sql.functions import col, explode

# Display the recommendations dataframe
display(recommendations)

# Explode the recommendations array into separate rows
df2 = recommendations.withColumn("movieid_rating", explode("recommendations"))
display(df2)

# Select relevant columns for display
display(df2.select("userId", col("movieid_rating.movieId"), col("movieid_rating.rating")))

Conclusion

In this article, we demonstrated how to build a collaborative filtering recommendation system using PySpark on Databricks. We covered data loading, preprocessing, model training, evaluation, and generating recommendations. This approach can be extended and refined to create more sophisticated recommendation systems for various applications.

By leveraging the capabilities of PySpark and the ALS algorithm, we can create scalable and effective recommendation systems for real-world scenarios. PySpark’s integration with Databricks provides a powerful platform for big data analytics and machine learning, enabling us to efficiently handle large-scale datasets and perform complex computations. With collaborative filtering, businesses can deliver personalized recommendations that enhance user engagement and satisfaction.