Recommendations for everyone

There’s something oddly addictive about Amazon’s “Customers who viewed this item also viewed…”. That shiny carousel of endless suggestions is there for one reason: to make me buy something, and I can’t help myself from scrolling through every single one looking for something interesting.

Recommending products to customers is well-trodden ground by now, so I’m not going to write yet another equation-riddled article that amounts to the recommendation engine equivalent of ‘hello world’. Instead I’ll give a little of the background and high-level explanation, and I’ll link to a good tutorial at the end.

Here’s the problem statement: find groups of related products / content in order to make suggestions to customers. As well as retail websites we can imagine a vast array of other applications, such as recommending recipes, food, or exercise plans. The same idea works with some modification for advertisements as well, as in “people who clicked this advert also clicked these other adverts”.

There are three data sources we’ll end up using:

Information about content - products, articles, videos, etc - that we’re going to recommend.
Information about the customers. At the very least, the ability to identify a customer uniquely.
Website activity data. This tells us what pages a person - whether logged in or anonymous - looks at.

Getting activity data from Google Analytics

Activity data is simply data that tells us which pages a user has looked at. We can get this from Google Analytics. In Google Analytics the activity is grouped by a session ID, which is generated by Google for each unique browser session.

We’ll want to bring this data into Google Big Query so we can combine it with other data sources and train a recommendation model. To make this easier we’ve written a little script which imports data from Google Analytics into a BigQuery table using the Analytics API - Github repository here.

Customer data (or logged in users)

We don’t strictly need to know about customers. We could build a recommendation model based solely on anonymous user browsing activity, because the activity data at least tells us about multiple page visits for a unique browser session.

But if we can associate logged in users with the activity data, all the better, because this enables us to track activity across multiple devices and longer time periods. It also means we can use other information besides activity to gain more insight later on. With Google Analytics it’s possible to enrich activity data with a custom user ID - here’s how.

Content data

The recommendation engine needs to answer the question: people who looked at this page looked at which other pages?

We can train a model to answer that question using only the activity data which we got from Google Analytics. But we also want a recommendation carousel that includes product names, photos and descriptions, and so we need to know something about what’s on the pages we’re recommending.

The kind of data we need for content depends on the nature of the content. For videos, we probably want a preview, and for clothing we certainly want a picture and price.

The big(query) picture

There are a few approaches to actually training the model itself. In particular there are lots of variations of collaborative filtering being used. The details of these don’t matter very much, because before any model can be useful, we need to set up some infrastructure to get data in and results out.

An architecture for making recommendations

The major components that go into it are:

Syncing Google Analytics data to Big Query. This needs run at a chosen time interval, say every day or every 6 hours.
Adding content and customer data to enrich recommendation results.
Training a recommendation model. Chose between any of the various successful approaches here. Model training occurs as part of a bigger analytics pipeline, so it’s easy to experiment with different methods and see what works best.
Recommendations API. This serves all the data required to generate recommendations as part of a website’s content.

Recommendations for everyone