Sequential Music Track Skips Prediction using Recurrent Neural Networks
How do music streaming companies know which songs we would like to listen to based on our recent listening history? Tune in to learn about a method based on deep learning.
Toggle between list and paragraph view.
- [00:00:00.900]Hi everyone.
- [00:00:01.800]My name is [inaudible] and I'm a third
year PhD student in the department of
- [00:00:06.480]statistics working under the
supervision of Dr. Support, no coach.
- [00:00:11.100]And today I'm going to talk to
you about sequential music tracks,
- [00:00:15.950]skips prediction using recurrent neural
networks. So let's get right to it.
- [00:00:21.420]We all listen to music in our daily lives,
- [00:00:24.690]and we usually go into these
apps search for our favorite
- [00:00:29.370]songs. And after that,
- [00:00:31.560]their recommendation system kind
of just takes over and success
- [00:00:36.360]for such streaming companies is
if we do not skip over a song
- [00:00:41.250]that they recommend to us,
- [00:00:43.590]and the goal of these companies is
that they are trying to minimize such
- [00:00:48.120]manual skips by the users.
- [00:00:50.460]So as to improve their user experience,
- [00:00:54.270]the data for this project
actually comes from Spotify,
- [00:00:58.050]its own sequence, shell
skip prediction challenge.
- [00:01:01.650]I took a reasonable sample
of 250,000 sessions,
- [00:01:05.490]which had information of
around 350,000 tracks.
- [00:01:11.070]And it had two main parts.
The first is the user log,
- [00:01:15.270]which had information on
the exact music tracks.
- [00:01:19.320]The user listened to whether they skipped
those tracks or did not skip those
- [00:01:24.840]What action led them to the
current song that they selected
- [00:01:29.340]themselves, that they shuffled
and end up at the song.
- [00:01:33.540]And we also have some other features on
the music tracks, such as their release,
- [00:01:38.400]easier how popular they are
and the loudness of the songs,
- [00:01:43.170]et cetera.
- [00:01:44.880]The exact prediction task
is best summarize using this
- [00:01:49.290]graphic suppose a user like
attract from pink Floyd,
- [00:01:54.330]like the track from guns and roses after
that did not like attract from beetles
- [00:01:59.670]can be predict whether they would
like or dislike attract from
- [00:02:04.410]ACDC. And the methods.
- [00:02:08.370]The class of models used in this
work are known as encoded decoders
- [00:02:13.290]sequence to sequence models,
- [00:02:15.300]which are based on recurrent neural
networks and the architecture is quite
- [00:02:20.280]So I will try to break down each
individual component so that we can
- [00:02:24.840]understand the big perspective
of how these models work.
- [00:02:30.690]The very first part of the
model architecture is known
as the track embedding
- [00:02:36.120]And the goal of this track
embedding network is to convert each
- [00:02:40.800]track into a numerical
representation in the form of
- [00:02:45.720]a vector and how it does.
- [00:02:47.490]So is we know there are some
stating features for the
- [00:02:51.960]tracks such as they're really
easier, how popular the tracks are.
- [00:02:56.860]We take those features for
each and we also let the
- [00:03:01.390]model automatically learn a
D dimensional vector for each
- [00:03:06.910]We take both of these to concatenate
them in one single vector and
- [00:03:11.740]then pass this vector through
a fully connected dense layer,
- [00:03:16.480]which is then known as the
embedded representation of this
- [00:03:21.070]particular track.
- [00:03:22.840]Next we have the session and coding
network and the goal of the session and
- [00:03:27.430]coding network is to form a
vectored representation for the
- [00:03:32.080]kind of music tracks that
were heard across the session
- [00:03:36.910]and how it does.
- [00:03:37.870]So is that for each time
step in the sequence,
- [00:03:42.130]it reads this embedding that we learned
from the track embedding network,
- [00:03:47.230]passes it through a LSTM,
- [00:03:49.690]which is a long short term memory
model. That is a deep learning model,
- [00:03:54.160]which is adept at handling
time series or sequential data
- [00:03:58.990]that we are dealing with over here.
- [00:04:01.420]And B LSTM reads this input at each time.
- [00:04:05.530]Step gives an output for that particular
timestamp, which is over here,
- [00:04:10.150]gives an output for the second time.
- [00:04:11.830]Step gives an output
for the last time step.
- [00:04:14.700]Now we have a matrix of different
- [00:04:19.470]outputs corresponding to each time step.
- [00:04:23.310]What this attention mechanism
does is it finds a best
- [00:04:27.810]way to take a weighted
sum of these outputs,
- [00:04:32.490]which we then consider
as the session and coding
- [00:04:37.170]this session and coding along
with some meta features such as
- [00:04:41.820]whether this session was on a
premium subscription Spotify or not
- [00:04:47.310]instantiate a stack of two
LSD in the playback and
- [00:04:51.930]coding network.
- [00:04:53.070]The goal of this network is to
read over the first half of the
- [00:04:57.750]sequence and form an encoded
representation, how it does.
- [00:05:02.250]So is it again for each time step,
- [00:05:06.090]take this embedding from
this track embedding network,
- [00:05:09.330]along with some other playback features,
which as we know are features such as,
- [00:05:14.340]how did they use it ended up at this
particular song, contacting AIDS.
- [00:05:18.750]It bosses it through the stack of LSTM,
- [00:05:22.350]which then passes that output to the next
- [00:05:27.060]iteration in the sequence.
- [00:05:29.790]And it also passes this embeddings for the
- [00:05:34.770]second track in the sequence,
along with its playback.
- [00:05:38.460]So we can think of this NSD M as
- [00:05:42.750]processing the relevant
output at each time step,
- [00:05:46.260]why keeping in memory,
what has happened so far,
- [00:05:49.470]and that is the real benefit of LSD.
- [00:05:51.900]And after it passes over
the entire sequence,
- [00:05:54.660]we have this encoded representation,
- [00:05:57.560]which is then used to fire up Ana,
- [00:06:00.880]another stack of two LSTM for the second
half of the sequence. And remember,
- [00:06:05.920]we are trying to get a skip or a no
skip prediction for the second half
- [00:06:11.500]and how it does.
- [00:06:12.160]So is it reads this embedding from
this track embedding network passes
- [00:06:17.020]it through the stack of LSTM and
it gets some output process output
- [00:06:22.000]from the LSTM, which is boss,
- [00:06:24.250]do a fully connected network
with one new run activated by the
- [00:06:29.200]sigmoid activation to make the
one or a zero skip prediction.
- [00:06:33.430]And you can see all of
these are interconnected.
- [00:06:35.740]The track embedding flows
into the session and coding,
- [00:06:38.560]which flows into the playback and coding
that is connected to the prediction
- [00:06:42.760]And this insight allows us to
train this model end to end using
- [00:06:47.410]binary cross entropy loss. We
use an average accuracy, uh,
- [00:06:51.970]evaluation metric. For example, if
we get five out of 10 in a sequence,
- [00:06:56.890]right? Average accuracy, 0.5,
- [00:06:59.320]we use a validation set of 20% to
monitor when to stop the training,
- [00:07:04.090]depending upon the improvements
on a validation set,
- [00:07:07.570]use an Adam optimizer,
- [00:07:09.280]which is an improved version
of stochastic gradient descent.
- [00:07:12.580]And we see whether we use a
staggered LSTM of depth to our stack
- [00:07:17.530]delis team of depth three. Uh,
- [00:07:19.780]the average accuracy on the
validation set is around 66% for both.
- [00:07:25.090]And the conclusions,
- [00:07:26.290]the learned embeddings can actually be
visualized to understand what the model
- [00:07:30.970]is learning. If we project
on a lower dimensional space,
- [00:07:35.200]these track embeddings,
- [00:07:36.280]we would expect tracks from similar
artists to cluster in about the
- [00:07:41.260]same areas and such models are very
useful and can be used for many
- [00:07:46.150]use cases, such as how do people make
decisions post a natural disaster,
- [00:07:50.830]what classes to recommend to a
student depending upon their prior the
- [00:07:54.980]performances. And that's all I had.
- [00:07:58.660]Thank you so much for your
time. If you have any questions,
- [00:08:01.390]reach out to me on my email, feel
free to connect with me on LinkedIn.
- [00:08:05.500]And these are some of the
references that I used in this book.
- [00:08:09.100]Thank you so much again,
Log in to post comments