Improving Accuracy of Rare Classes in Machine Learning Classifiers

Eylon Caplan Author

08/03/2020 Added

130 Plays

Description

This video summarizes my work and findings in simple machine learning models and in techniques that improve the classification of rare classes.

Searchable Transcript

Search:

[00:00:00.000]Hi, my name is Eylon Caplan
[00:00:04.178]and I did my UCARE project
[00:00:06.710]on machine learning
[00:00:08.201]and specifically the problem that we
[00:00:10.411]tackled was improving the accuracy of
[00:00:12.861]certain rare classes in machine learning
[00:00:16.274]classification.
[00:00:18.053]And I did my project with Dr. Hasan.
[00:00:22.700]So I'll first introduce machine learning
[00:00:24.971]in general. So, the classifiers that we're
[00:00:27.947]talking about are some sort of algorithm
[00:00:30.018]that is able to take as input a large
[00:00:32.823]training set of data. And then, by some
[00:00:36.745]means, learn to distinguish between
[00:00:39.740]different classes. And then given a
[00:00:41.958]certain input, it's supposed to be able
[00:00:45.111]to classify that into one of the classes,
[00:00:47.705]hopefully with a high accuracy. And so,
[00:00:51.226]specifically, we were talking about
[00:00:53.269]image classification, but the problem
[00:00:55.754]can be expanded to any classification,
[00:00:58.933]really. So one goal was just to explore
[00:01:04.092]different models, find out their strengths
[00:01:06.842]and weaknesses, find ways to evaluate how
[00:01:09.271]good a model is, and then, of course, to
[00:01:12.021]improve the accuracy of certain classes
[00:01:14.593]that are rare during training. So I first
[00:01:18.705]started out with some simple regression,
[00:01:20.834]which is one of the simplest forms of
[00:01:22.757]machine learning. The goal in regression
[00:01:24.973]is just to find a continuous value. So in
[00:01:28.289]this case, we used the California dataset,
[00:01:30.935]which is a dataset that has all these
[00:01:33.539]statistics about housing, including the
[00:01:37.681]target value, which is median house value.
[00:01:39.825]And then the model was fed this
[00:01:42.086]information and then tried to guess the
[00:01:44.603]median house value of certain districts
[00:01:47.751]or regions in California. As you can see,
[00:01:51.654]because it's in reality a pretty
[00:01:54.093]complicated model, the linear regression-
[00:01:56.476]type model was underfitting, but I did
[00:01:59.657]learn quite a bit about being able to use
[00:02:02.838]this in Python and quite a bit about
[00:02:05.237]visualizing the data
[00:02:07.101]in order to figure
[00:02:08.024]out what kind of model you want to
[00:02:09.672]fit it with. The main dataset that I used
[00:02:13.854]throughout the summer was the wine
[00:02:17.225]quality dataset, which is about 5,000
[00:02:20.876]wines which were chemically analyzed
[00:02:24.865]with all of the things that you can see
[00:02:26.721]in front of you, along with the quality,
[00:02:29.337]which was determined by some
[00:02:31.484]professional wine tasters. And the goal
[00:02:35.684]was to turn this into a binary
[00:02:38.993]classification--so either good wine or
[00:02:41.007]bad wine. And the way I did that was I
[00:02:43.019]just used some threshold of the quality,
[00:02:45.276]which was originally I think 1-7. And so I
[00:02:48.284]was able to turn it into either good or
[00:02:50.307]bad wine. As you can see the ratio of good
[00:02:54.113]wines was higher than the ratio of bad
[00:02:56.269]wines, which is something that enabled me
[00:02:59.611]to use this dataset as sort of like an
[00:03:02.022]analogy to rare classification. Again I
[00:03:07.879]started here with logistic regression.
[00:03:09.977]This is like the linear regression except
[00:03:13.857]it is for classification. And in this
[00:03:17.210]case, as you can see with two features,
[00:03:19.401]it simply draws a line that maximizes
[00:03:22.817]the classes' difference--separation. And
[00:03:26.102]in this case it doesn't perform very well
[00:03:29.102]because 0.668 is barely above, or even
[00:03:34.973]below just guessing randomly every time.
[00:03:38.726]But still, you can use this visualization
[00:03:41.092]technique to see whether it would even
[00:03:43.166]be possible to draw something that could
[00:03:45.206]distinguish between the two classes. This
[00:03:49.559]is logistic regression. Again, this is
[00:03:51.878]using two different features from before.
[00:03:53.900]And these ones, as you can see, are more
[00:03:55.802]separable than the one before it. But
[00:03:57.571]still, you don't have as good accuracy.
[00:04:00.379]You still have a higher accuracy than
[00:04:02.320]guessing randomly every time, which is
[00:04:05.054]better. But also something that you can
[00:04:07.759]see is that you can use two features
[00:04:09.754]individually, but also you can use all of
[00:04:11.959]the features. So there were about 12
[00:04:13.762]features in the wine quality dataset. And
[00:04:16.490]using all of the features, there might be
[00:04:18.629]some more complex way to classify the good
[00:04:24.440]vs. bad wine, but it's a lot harder to
[00:04:26.777]visualized because we have that
[00:04:29.040]dimensionality problem with being unable
[00:04:31.437]to graph all of the features at once. So
[00:04:33.743]sometimes it was useful to graph two
[00:04:35.771]features at once, even though they weren't
[00:04:37.811]totally separable. Then thing we tried was
[00:04:42.682]polynomial regression, which is actually
[00:04:44.939]the same thing as logistic regression
[00:04:46.790]except it adds some polynomial
[00:04:49.539]combinations of the features, which is
[00:04:52.824]really expensive and grows really quickly
[00:04:55.220]if you have a lot of features. So I really
[00:04:59.020]only did this with a few features at a
[00:05:01.694]time and as you can see this is the same
[00:05:03.928]set of features from before and it does
[00:05:06.118]fit it a little bit better, but still
[00:05:08.997]there is a lot of overlap here and it's
[00:05:11.036]very difficult to find separation with
[00:05:12.860]these kinds of models. The next one that
[00:05:17.406]we tried is a really powerful technique
[00:05:19.669]called KNN, which is the K-nearest neigbor
[00:05:21.720]And this is a very simple idea but it can
[00:05:25.523]fit very complex data and the way that it
[00:05:28.452]does it is it just mimics what its
[00:05:30.410]neighbors are doing. So you pick a data
[00:05:32.378]point and it'll look around at its
[00:05:34.106]neighbors and form some sort of
[00:05:36.157]consensus and decide what's more likely.
[00:05:39.149]The hyperparameters there are how you
[00:05:42.714]classify distance between the points, and
[00:05:44.970]also how many points you're looking
[00:05:46.797]at around where the specific input is.
[00:05:50.796]But you can see here there is some severe
[00:05:53.526][overfitting], because what you have here
[00:05:55.710]is a high training accuracy and a lower
[00:05:58.190]test accuracy, which means that it's not
[00:06:02.125]generalizing well. But still this has been
[00:06:06.605]the highest accuracy that we'd gotten so
[00:06:09.771]far. The next one was a support vector
[00:06:12.373]machine. This one is also capable of very
[00:06:16.738]complex classification.
[00:06:20.161]And it also had very
[00:06:21.161]high results. Again, I always did one
[00:06:23.227]feature vs. another and then all of the
[00:06:25.091]features at once because one vs. another
[00:06:27.444]helps you see how it's distinguishing; if
[00:06:30.905]it's doing it properly. And using all of
[00:06:33.568]the features might compensate for more
[00:06:35.709]complex models--ways of classifying it.
[00:06:42.456]And the good thing about the support
[00:06:45.825]vector machine is that it's much, much
[00:06:48.520]cheaper, than say polynomial regression,
[00:06:52.160]but it can fit very complicated models
[00:06:55.794]here. This one also had, I think,
[00:06:58.708]the highest [accuracy] of the models that
[00:07:00.761]we had until that point.
[00:07:03.664]Here we tried a bunch of different
[00:07:06.350]balancing techniques, because there were
[00:07:08.264]more good wines than there were bad wines,
[00:07:10.463]and so trying to compensate for that there
[00:07:12.491]were a few techniques. So one of them was
[00:07:14.513]just weighting classes, which is just
[00:07:16.309]adding more importance to the rarer
[00:07:18.188]classes. But this doesn't perform very
[00:07:20.687]well. And it didn't perform very well for
[00:07:23.345]us either. And then there is
[00:07:25.260]undersampling and oversampling, which is
[00:07:27.176]just feeding fewer examples of the
[00:07:29.122]overrepresented class and oversampling
[00:07:32.130]which is the opposite. And then there was
[00:07:34.526]also an unsupervised learning method
[00:07:37.103]called one class SVM, which tries to
[00:07:38.922]detect outliers, but this method is mostly
[00:07:41.650]effective if classification can be made
[00:07:47.734]via determining outliers, which isn't
[00:07:49.996]always the case. So we see here with
[00:07:54.610]sampling and no sampling, the accuracy
[00:07:57.790]went down when we tried to do the sampling
[00:08:00.690]But the interesting thing was that the
[00:08:02.866]precision and recall tradeoff, which are
[00:08:05.029]two measures of the types of errors that
[00:08:07.109]are being made, could be more equalized
[00:08:09.090]using these techniques. So with no
[00:08:11.590]sampling, there in the middle, you see
[00:08:14.295]that it has a very high recall but a much
[00:08:16.562]lower precision. And with both of the
[00:08:18.467]sampling techniques, you see that those
[00:08:20.467]recall and precision got a little closer
[00:08:22.406]to each other, which is often the thing
[00:08:24.294]that you want to do when trying to solve
[00:08:26.853]this rare classification problem.
[00:08:29.605]This is just a list of the models that we
[00:08:31.570]tried. I didn't go over the rest of them
[00:08:33.761]because there wasn't too much time, but
[00:08:35.806]the random forest, boosting and bagging,
[00:08:38.077]and the ensemble classifier were also done
[00:08:40.606]The idea of this project was to utilize
[00:08:46.046]these techniques that I tried and to also
[00:08:48.827]build a basis for these machine learning
[00:08:53.505]models and also the techniques being used
[00:08:55.691]here. The idea would be to use it on the
[00:09:00.735]Snapshot Serengeti dataset, which is a
[00:09:03.645]dataset of camera-trap images of animals
[00:09:08.337]taken and the goal would be to classify
[00:09:10.940]animals and also what action the animals
[00:09:15.226]are doing--if they're sleeping or if
[00:09:18.792]they're eating or drinking or something
[00:09:20.923]like that. And as you can see in this
[00:09:23.760]graph there is a huge imbalance in the
[00:09:26.095]dataset; these are of pictures with
[00:09:28.125]animals in them. Zebras and wildebeest
[00:09:31.212]make up over 50% of the pictures,
[00:09:35.310]which means that these less common
[00:09:38.445]animals are much more difficult to
[00:09:40.325]classify and have higher rates of error.
[00:09:42.753]So the goal would be to utilize the
[00:09:45.929]techniques that I studied in order to
[00:09:48.341]improve the accuracy--or precision-recall
[00:09:51.712]tradeoff in this dataset. Thank you.

The screen size you are trying to search captions on is too small!

You can always jump over to MediaHub and check it out there.

Comments

0 Comments

Improving Accuracy of Rare Classes in Machine Learning Classifiers

Description

Searchable Transcript

Comments icon comment

Related Channels

Comments