Sequential Music Track Skips Prediction using Recurrent Neural Networks

Ved Piyush Author

04/05/2021 Added

109 Plays

Description

How do music streaming companies know which songs we would like to listen to based on our recent listening history? Tune in to learn about a method based on deep learning.

Searchable Transcript

Search:

[00:00:00.900]Hi everyone.
[00:00:01.800]My name is [inaudible] and I'm a third year PhD student in the department of
[00:00:06.480]statistics working under the supervision of Dr. Support, no coach.
[00:00:11.100]And today I'm going to talk to you about sequential music tracks,
[00:00:15.950]skips prediction using recurrent neural networks. So let's get right to it.
[00:00:21.420]We all listen to music in our daily lives,
[00:00:24.690]and we usually go into these apps search for our favorite
[00:00:29.370]songs. And after that,
[00:00:31.560]their recommendation system kind of just takes over and success
[00:00:36.360]for such streaming companies is if we do not skip over a song
[00:00:41.250]that they recommend to us,
[00:00:43.590]and the goal of these companies is that they are trying to minimize such
[00:00:48.120]manual skips by the users.
[00:00:50.460]So as to improve their user experience,
[00:00:54.270]the data for this project actually comes from Spotify,
[00:00:58.050]its own sequence, shell skip prediction challenge.
[00:01:01.650]I took a reasonable sample of 250,000 sessions,
[00:01:05.490]which had information of around 350,000 tracks.
[00:01:11.070]And it had two main parts. The first is the user log,
[00:01:15.270]which had information on the exact music tracks.
[00:01:19.320]The user listened to whether they skipped those tracks or did not skip those
[00:01:24.090]tracks.
[00:01:24.840]What action led them to the current song that they selected
[00:01:29.340]themselves, that they shuffled and end up at the song.
[00:01:33.540]And we also have some other features on the music tracks, such as their release,
[00:01:38.400]easier how popular they are and the loudness of the songs,
[00:01:43.170]et cetera.
[00:01:44.880]The exact prediction task is best summarize using this
[00:01:49.290]graphic suppose a user like attract from pink Floyd,
[00:01:54.330]like the track from guns and roses after that did not like attract from beetles
[00:01:59.670]can be predict whether they would like or dislike attract from
[00:02:04.410]ACDC. And the methods.
[00:02:08.370]The class of models used in this work are known as encoded decoders
[00:02:13.290]sequence to sequence models,
[00:02:15.300]which are based on recurrent neural networks and the architecture is quite
[00:02:19.680]complex.
[00:02:20.280]So I will try to break down each individual component so that we can
[00:02:24.840]understand the big perspective of how these models work.
[00:02:30.690]The very first part of the model architecture is known as the track embedding
[00:02:35.520]network.
[00:02:36.120]And the goal of this track embedding network is to convert each
[00:02:40.800]track into a numerical representation in the form of
[00:02:45.720]a vector and how it does.
[00:02:47.490]So is we know there are some stating features for the
[00:02:51.960]tracks such as they're really easier, how popular the tracks are.
[00:02:56.860]We take those features for each and we also let the
[00:03:01.390]model automatically learn a D dimensional vector for each
[00:03:06.160]strep.
[00:03:06.910]We take both of these to concatenate them in one single vector and
[00:03:11.740]then pass this vector through a fully connected dense layer,
[00:03:16.480]which is then known as the embedded representation of this
[00:03:21.070]particular track.
[00:03:22.840]Next we have the session and coding network and the goal of the session and
[00:03:27.430]coding network is to form a vectored representation for the
[00:03:32.080]kind of music tracks that were heard across the session
[00:03:36.910]and how it does.
[00:03:37.870]So is that for each time step in the sequence,
[00:03:42.130]it reads this embedding that we learned from the track embedding network,
[00:03:47.230]passes it through a LSTM,
[00:03:49.690]which is a long short term memory model. That is a deep learning model,
[00:03:54.160]which is adept at handling time series or sequential data
[00:03:58.990]that we are dealing with over here.
[00:04:01.420]And B LSTM reads this input at each time.
[00:04:05.530]Step gives an output for that particular timestamp, which is over here,
[00:04:10.150]gives an output for the second time.
[00:04:11.830]Step gives an output for the last time step.
[00:04:14.700]Now we have a matrix of different
[00:04:19.470]outputs corresponding to each time step.
[00:04:23.310]What this attention mechanism does is it finds a best
[00:04:27.810]way to take a weighted sum of these outputs,
[00:04:32.490]which we then consider as the session and coding
[00:04:37.170]this session and coding along with some meta features such as
[00:04:41.820]whether this session was on a premium subscription Spotify or not
[00:04:47.310]instantiate a stack of two LSD in the playback and
[00:04:51.930]coding network.
[00:04:53.070]The goal of this network is to read over the first half of the
[00:04:57.750]sequence and form an encoded representation, how it does.
[00:05:02.250]So is it again for each time step,
[00:05:06.090]take this embedding from this track embedding network,
[00:05:09.330]along with some other playback features, which as we know are features such as,
[00:05:14.340]how did they use it ended up at this particular song, contacting AIDS.
[00:05:18.750]It bosses it through the stack of LSTM,
[00:05:22.350]which then passes that output to the next
[00:05:27.060]iteration in the sequence.
[00:05:29.790]And it also passes this embeddings for the
[00:05:34.770]second track in the sequence, along with its playback.
[00:05:38.460]So we can think of this NSD M as
[00:05:42.750]processing the relevant output at each time step,
[00:05:46.260]why keeping in memory, what has happened so far,
[00:05:49.470]and that is the real benefit of LSD.
[00:05:51.900]And after it passes over the entire sequence,
[00:05:54.660]we have this encoded representation,
[00:05:57.560]which is then used to fire up Ana,
[00:06:00.880]another stack of two LSTM for the second half of the sequence. And remember,
[00:06:05.920]we are trying to get a skip or a no skip prediction for the second half
[00:06:11.500]and how it does.
[00:06:12.160]So is it reads this embedding from this track embedding network passes
[00:06:17.020]it through the stack of LSTM and it gets some output process output
[00:06:22.000]from the LSTM, which is boss,
[00:06:24.250]do a fully connected network with one new run activated by the
[00:06:29.200]sigmoid activation to make the one or a zero skip prediction.
[00:06:33.430]And you can see all of these are interconnected.
[00:06:35.740]The track embedding flows into the session and coding,
[00:06:38.560]which flows into the playback and coding that is connected to the prediction
[00:06:41.980]network.
[00:06:42.760]And this insight allows us to train this model end to end using
[00:06:47.410]binary cross entropy loss. We use an average accuracy, uh,
[00:06:51.970]evaluation metric. For example, if we get five out of 10 in a sequence,
[00:06:56.890]right? Average accuracy, 0.5,
[00:06:59.320]we use a validation set of 20% to monitor when to stop the training,
[00:07:04.090]depending upon the improvements on a validation set,
[00:07:07.570]use an Adam optimizer,
[00:07:09.280]which is an improved version of stochastic gradient descent.
[00:07:12.580]And we see whether we use a staggered LSTM of depth to our stack
[00:07:17.530]delis team of depth three. Uh,
[00:07:19.780]the average accuracy on the validation set is around 66% for both.
[00:07:25.090]And the conclusions,
[00:07:26.290]the learned embeddings can actually be visualized to understand what the model
[00:07:30.970]is learning. If we project on a lower dimensional space,
[00:07:35.200]these track embeddings,
[00:07:36.280]we would expect tracks from similar artists to cluster in about the
[00:07:41.260]same areas and such models are very useful and can be used for many
[00:07:46.150]use cases, such as how do people make decisions post a natural disaster,
[00:07:50.830]what classes to recommend to a student depending upon their prior the
[00:07:54.980]performances. And that's all I had.
[00:07:58.660]Thank you so much for your time. If you have any questions,
[00:08:01.390]reach out to me on my email, feel free to connect with me on LinkedIn.
[00:08:05.500]And these are some of the references that I used in this book.
[00:08:09.100]Thank you so much again,

The screen size you are trying to search captions on is too small!

You can always jump over to MediaHub and check it out there.

Tags:
Deep Learning
Machine Learning
Recommendation Systems
Artificial Intelligence
Python Programming
Recurrent Neural Networks

Comments

0 Comments

Sequential Music Track Skips Prediction using Recurrent Neural Networks

Description

Searchable Transcript

Comments icon comment

Related Channels

Comments