A generative model for music track playlists

Brett Vintch
iHeartRadio Tech

--

Our mission at iHeartRadio is to deliver the music you’ll love when you want it. In the data science group, this often translates to a recommendation task over live stations or track playlists, and we employ a number of powerful behavioral and content-based production systems to deliver high-quality personalized experiences for our users. This post is not about those systems. Rather, this post is about a new, efficient technique for creating playlists from scratch using a compact description of our user’s tastes and listening habits.

Our approach is based on Variational Autoencoders, and views station and playlist creation as a generative process. Although every listener is unique, there are a handful of musical dimensions that describe much of our user’s listening habits and musical tastes, and we can learn these dimensions from real listening patterns. Models trained to generate similar patterns succeed in forming useful playlists, and the outcome can be easily manipulated according to user preference. Together, these results hint at new and powerful ways to discover and explore music.

Variational Autoencoders

Autoencoders are a specific type of multilayer artificial neural network in which the output is trained to reconstruct the input. This is accomplished with an objective that seeks to minimize the error between the two outermost layers. The central, latent layer is typically smaller than the input and output, and it breaks the network into two symmetric sub-networks: the encoder, which transforms the input to the latent space, and the decoder, that generates outputs from latent space representations (Figure 1).

Figure 1. Variational Autoencoder

Variational Autoencoders (VAE) use the same architecture as an Autoencoder, but they make additional assumptions about the distribution of latent node activations. These assumptions are derived from the premise that the latent node activations are normally distributed, and inference in these models takes a variational approach over specific input domains. In practice, they typically constrain the latent space to be smooth, so that similar vectors in the input space map to nearby locations in the latent space (though this is common in linear systems, it is by no means guaranteed in multilayer, nonlinear neural networks). This ensures that the latent space carries as much information about the input domain as possible. The smooth representation also aids in generation; sampling points along a continuous manifold in the latent space will generate a series of novel outputs that will be perceived as a logical, continuous transformation from the starting point.

The input domain for autoencoders is arbitrary, but common applications stick to narrow ranges of images, such as hand-written numerical digits and faces. Due to their popularity, there are many examples of their implementation in Python neural network packages, like Tensorflow and Keras (here and here), the latter of which uses backends like Tensorflow and Theano.

Variational Autoencoders as playlist generators

Autoencoders are often used for dimensionality reduction, and when viewed from this angle they have a lot in common with recommender system approaches like Matrix Factorization and Restricted Boltzmann Machines. In these scenarios each input/output node represents an item, and the value of the node for an input vector represents the item’s relevance for a specific user. Indeed, Autoencoders have been used for recommendation systems before; reconstructions are never exact, and reconstructed items with high false-positive errors may be items that a user would like but have not yet been exposed to.

Extending this analogy to Variational Autoencoders suggests that training a VAE on user feed-back for items could be used to learn a low-dimensional representation of those items, and this representation could be used to generate novel item lists. Although classic recommender systems can also be used to create coherent lists, there are a few advantages to using a VAE for this purpose. First, the model is explicitly generative. This means that we do not need to create ad-hoc rules on top of the recommender. Second, the latent space that VAEs create is highly informative and smooth, unlike those created by matrix factorization approaches. This ensures sampling from this space is intuitive, and makes it easier to traverse and interpolate. Finally, if the encoder and decoder in the VAE are artificial neural networks with at least one hidden layer, then they are theoretically able to learn arbitrarily complex relationships between items across users. This is again in contrast to linear matrix factorization methods that cannot find these potentially useful nonlinear connections.

Results

In this post we illustrate playlists that are generated from VAE models that are trained on our user’s track thumb history. The data set consists of a selection of 100,000 active iHeartRadio users across a sample of 5,000 popular tracks. Each input vector is one user’s thumb-up count across these tracks, and is normalized so that the maximum value is one (users can thumb a track in multiple contexts, which can result in un-normalized counts that exceed one). The VAEs are constructed with Tensorflow and Keras with a 90%/10% train/test split, and are trained until reconstruction error in the holdout set stops improving (usually 10–15 epochs). Once the model is trained, playlists are generated by decoding samples from the latent space, and then sorting the output vector by tracks with the highest value. For illustration purposes, we select the top 10 tracks from this list.

Exploring the latent space of a VAE model is particularly convenient when the latent space is limited to only two-dimensions. In this case, the effects of each latent dimension can be visualized by plotting results over a grid in the latent space. We use a 3x3 grid and generate playlists for each of the resulting nine points (Figure 2). The central playlist can be thought of as the most likely playlist for our user base, as it is generated from the mode of the latent distribution. Moving horizontally or vertically from the center demonstrate the effects of each latent dimension independent of the other. There appears to be a smooth transition from one extreme to the other, but we’ll leave it to the reader to provide their own musical narrative to these transformations. The corners of the grid show the interactions, both positive and negative, between the two axes.

Figure 2. Generated playlists for a 2-D latent space model

Two-dimensional latent spaces are easy to visualize, but these networks may be too constrained to account for the full complexity of user tastes and behaviors. This failure would be reflected in poor reconstruction fidelity in the trained network. In fact, we do see some evidence for this. The left panel of Figure 3 shows the input thumb record along with the output reconstruction for three users, where the top, middle, and bottom rows depict the worst, median, and best reconstruction respectively. The right panel shows a similar plot for a 10-dimensional latent space model. On average, the more complex model is able to perform 15% better in terms of mean-squared reconstruction error. However, the downside is that though they may be more powerful, higher dimensional models are also more difficult to visualize and explain.

Figure 3. Input reconstruction for 2-D and 10-D models (top row: worst reconstruction; bottom row: best reconstruction)

To assist in exploring the 10-dimensional model we build a linear classifier to predict track genre from the latent space representation. From a generative perspective, the decoder axes then serve as rails upon which we can slide a latent space vector to generate playlists for any particular genre combination. We start by building a data set for the genre classifier by independently projecting each track into the latent space by itself. That is, for each track we create an artificial input vector that has a value of one for that track, but all zeros elsewhere. This vector has the same format as the user thumb input that was used to train the VAE, where each element in the vector corresponds to a unique track. This new input is then run through the encoder, and the resulting latent space vector is then labelled with the track’s genre(s). Finally, with all tracks labelled in this manner, we build a classifier to predict track genre from the latent space representation (Figure 4).

Figure 4. Method for creating a classifier for genre in the model latent space

The simple feed-forward structure of the decoder means that generating new playlists is fast. Any random point in the latent space can be passed through the decoder to generate a new playlist on the fly. For the 10-dimensional model we pick points in the latent space according to the genre classifier. That is, we move a point in the latent space according to “rails” defined by the linear genre classifier. If we want a playlist to be a little more Country, we add the Country classification vector to the current latent space point. Figure 5 shows the real-time generation of new playlists, demonstrating both the speed of generation, and the ability to morph playlists according to genre. Again, we’ll leave it to the reader to assess the qualitative aspects of these playlists.

Figure 5. Playlist generator based upon a latent space genre classifier

Future work

This post shows a simple demonstration of VAEs over a small user thumb data set, but it illustrates the larger potential in using generative approaches to creating playlist and radio stations. Models can be trained with different types of data, such as user-created playlists and radio spin logs. The genre classifier that we demonstrated here can also be replaced with any number of other musical tags, such as mood, era, or format. These techniques can be used to generate playlists from whole cloth, such as we’ve done here, or could be used to morph existing playlists to be “a little more country”, or “more like my friend Joe’s”. Giving users direct control of these levers can facilitate music discovery by allowing them to drift from their normal preferences along clearly sign-posted directions. Giving radio or content producers the same tools could allow for stations to walk seamlessly between distinct musical tastes in a set over time, without jarring transitions. We are actively exploring these avenues, and more.

--

--

statistical learning and strategy for understanding and driving behavior