Sequential Music Track Skips Prediction using Recurrent Neural Networks
Ved Piyush
Author
04/05/2021
Added
109
Plays
Description
How do music streaming companies know which songs we would like to listen to based on our recent listening history? Tune in to learn about a method based on deep learning.
Searchable Transcript
Toggle between list and paragraph view.
- [00:00:00.900]Hi everyone.
- [00:00:01.800]My name is [inaudible] and I'm a third year PhD student in the department of
- [00:00:06.480]statistics working under the supervision of Dr. Support, no coach.
- [00:00:11.100]And today I'm going to talk to you about sequential music tracks,
- [00:00:15.950]skips prediction using recurrent neural networks. So let's get right to it.
- [00:00:21.420]We all listen to music in our daily lives,
- [00:00:24.690]and we usually go into these apps search for our favorite
- [00:00:29.370]songs. And after that,
- [00:00:31.560]their recommendation system kind of just takes over and success
- [00:00:36.360]for such streaming companies is if we do not skip over a song
- [00:00:41.250]that they recommend to us,
- [00:00:43.590]and the goal of these companies is that they are trying to minimize such
- [00:00:48.120]manual skips by the users.
- [00:00:50.460]So as to improve their user experience,
- [00:00:54.270]the data for this project actually comes from Spotify,
- [00:00:58.050]its own sequence, shell skip prediction challenge.
- [00:01:01.650]I took a reasonable sample of 250,000 sessions,
- [00:01:05.490]which had information of around 350,000 tracks.
- [00:01:11.070]And it had two main parts. The first is the user log,
- [00:01:15.270]which had information on the exact music tracks.
- [00:01:19.320]The user listened to whether they skipped those tracks or did not skip those
- [00:01:24.090]tracks.
- [00:01:24.840]What action led them to the current song that they selected
- [00:01:29.340]themselves, that they shuffled and end up at the song.
- [00:01:33.540]And we also have some other features on the music tracks, such as their release,
- [00:01:38.400]easier how popular they are and the loudness of the songs,
- [00:01:43.170]et cetera.
- [00:01:44.880]The exact prediction task is best summarize using this
- [00:01:49.290]graphic suppose a user like attract from pink Floyd,
- [00:01:54.330]like the track from guns and roses after that did not like attract from beetles
- [00:01:59.670]can be predict whether they would like or dislike attract from
- [00:02:04.410]ACDC. And the methods.
- [00:02:08.370]The class of models used in this work are known as encoded decoders
- [00:02:13.290]sequence to sequence models,
- [00:02:15.300]which are based on recurrent neural networks and the architecture is quite
- [00:02:19.680]complex.
- [00:02:20.280]So I will try to break down each individual component so that we can
- [00:02:24.840]understand the big perspective of how these models work.
- [00:02:30.690]The very first part of the model architecture is known as the track embedding
- [00:02:35.520]network.
- [00:02:36.120]And the goal of this track embedding network is to convert each
- [00:02:40.800]track into a numerical representation in the form of
- [00:02:45.720]a vector and how it does.
- [00:02:47.490]So is we know there are some stating features for the
- [00:02:51.960]tracks such as they're really easier, how popular the tracks are.
- [00:02:56.860]We take those features for each and we also let the
- [00:03:01.390]model automatically learn a D dimensional vector for each
- [00:03:06.160]strep.
- [00:03:06.910]We take both of these to concatenate them in one single vector and
- [00:03:11.740]then pass this vector through a fully connected dense layer,
- [00:03:16.480]which is then known as the embedded representation of this
- [00:03:21.070]particular track.
- [00:03:22.840]Next we have the session and coding network and the goal of the session and
- [00:03:27.430]coding network is to form a vectored representation for the
- [00:03:32.080]kind of music tracks that were heard across the session
- [00:03:36.910]and how it does.
- [00:03:37.870]So is that for each time step in the sequence,
- [00:03:42.130]it reads this embedding that we learned from the track embedding network,
- [00:03:47.230]passes it through a LSTM,
- [00:03:49.690]which is a long short term memory model. That is a deep learning model,
- [00:03:54.160]which is adept at handling time series or sequential data
- [00:03:58.990]that we are dealing with over here.
- [00:04:01.420]And B LSTM reads this input at each time.
- [00:04:05.530]Step gives an output for that particular timestamp, which is over here,
- [00:04:10.150]gives an output for the second time.
- [00:04:11.830]Step gives an output for the last time step.
- [00:04:14.700]Now we have a matrix of different
- [00:04:19.470]outputs corresponding to each time step.
- [00:04:23.310]What this attention mechanism does is it finds a best
- [00:04:27.810]way to take a weighted sum of these outputs,
- [00:04:32.490]which we then consider as the session and coding
- [00:04:37.170]this session and coding along with some meta features such as
- [00:04:41.820]whether this session was on a premium subscription Spotify or not
- [00:04:47.310]instantiate a stack of two LSD in the playback and
- [00:04:51.930]coding network.
- [00:04:53.070]The goal of this network is to read over the first half of the
- [00:04:57.750]sequence and form an encoded representation, how it does.
- [00:05:02.250]So is it again for each time step,
- [00:05:06.090]take this embedding from this track embedding network,
- [00:05:09.330]along with some other playback features, which as we know are features such as,
- [00:05:14.340]how did they use it ended up at this particular song, contacting AIDS.
- [00:05:18.750]It bosses it through the stack of LSTM,
- [00:05:22.350]which then passes that output to the next
- [00:05:27.060]iteration in the sequence.
- [00:05:29.790]And it also passes this embeddings for the
- [00:05:34.770]second track in the sequence, along with its playback.
- [00:05:38.460]So we can think of this NSD M as
- [00:05:42.750]processing the relevant output at each time step,
- [00:05:46.260]why keeping in memory, what has happened so far,
- [00:05:49.470]and that is the real benefit of LSD.
- [00:05:51.900]And after it passes over the entire sequence,
- [00:05:54.660]we have this encoded representation,
- [00:05:57.560]which is then used to fire up Ana,
- [00:06:00.880]another stack of two LSTM for the second half of the sequence. And remember,
- [00:06:05.920]we are trying to get a skip or a no skip prediction for the second half
- [00:06:11.500]and how it does.
- [00:06:12.160]So is it reads this embedding from this track embedding network passes
- [00:06:17.020]it through the stack of LSTM and it gets some output process output
- [00:06:22.000]from the LSTM, which is boss,
- [00:06:24.250]do a fully connected network with one new run activated by the
- [00:06:29.200]sigmoid activation to make the one or a zero skip prediction.
- [00:06:33.430]And you can see all of these are interconnected.
- [00:06:35.740]The track embedding flows into the session and coding,
- [00:06:38.560]which flows into the playback and coding that is connected to the prediction
- [00:06:41.980]network.
- [00:06:42.760]And this insight allows us to train this model end to end using
- [00:06:47.410]binary cross entropy loss. We use an average accuracy, uh,
- [00:06:51.970]evaluation metric. For example, if we get five out of 10 in a sequence,
- [00:06:56.890]right? Average accuracy, 0.5,
- [00:06:59.320]we use a validation set of 20% to monitor when to stop the training,
- [00:07:04.090]depending upon the improvements on a validation set,
- [00:07:07.570]use an Adam optimizer,
- [00:07:09.280]which is an improved version of stochastic gradient descent.
- [00:07:12.580]And we see whether we use a staggered LSTM of depth to our stack
- [00:07:17.530]delis team of depth three. Uh,
- [00:07:19.780]the average accuracy on the validation set is around 66% for both.
- [00:07:25.090]And the conclusions,
- [00:07:26.290]the learned embeddings can actually be visualized to understand what the model
- [00:07:30.970]is learning. If we project on a lower dimensional space,
- [00:07:35.200]these track embeddings,
- [00:07:36.280]we would expect tracks from similar artists to cluster in about the
- [00:07:41.260]same areas and such models are very useful and can be used for many
- [00:07:46.150]use cases, such as how do people make decisions post a natural disaster,
- [00:07:50.830]what classes to recommend to a student depending upon their prior the
- [00:07:54.980]performances. And that's all I had.
- [00:07:58.660]Thank you so much for your time. If you have any questions,
- [00:08:01.390]reach out to me on my email, feel free to connect with me on LinkedIn.
- [00:08:05.500]And these are some of the references that I used in this book.
- [00:08:09.100]Thank you so much again,
The screen size you are trying to search captions on is too small!
You can always jump over to MediaHub and check it out there.
- Tags:
- Deep Learning
- Machine Learning
- Recommendation Systems
- Artificial Intelligence
- Python Programming
- Recurrent Neural Networks
Log in to post comments
Embed
Copy the following code into your page
HTML
<div style="padding-top: 56.25%; overflow: hidden; position:relative; -webkit-box-flex: 1; flex-grow: 1;"> <iframe style="bottom: 0; left: 0; position: absolute; right: 0; top: 0; border: 0; height: 100%; width: 100%;" src="https://mediahub.unl.edu/media/16365?format=iframe&autoplay=0" title="Video Player: Sequential Music Track Skips Prediction using Recurrent Neural Networks" allowfullscreen ></iframe> </div>
Comments
0 Comments