The Effects of Applying Multiple Constraints in Reinforcement Learning

Serigne Toure Author

08/03/2020 Added

22 Plays

Description

Video Poster Presentation 2020

Searchable Transcript

Search:

[00:00:00.770]Hello, my name is Serigne Toure,
[00:00:02.590]and I am studying the effects
[00:00:04.020]of applying multiple constraints in reinforcement learning
[00:00:06.800]with Dr. Stephen Scott, my research advisor,
[00:00:12.060]and Eleanor Quint, my graduate student
[00:00:17.210]that's gonna help me afford this research.
[00:00:20.720]So, as an introduction, this research,
[00:00:23.070]studies the relation between reinforcement learning,
[00:00:25.830]agents, and the constraints we apply to those agents.
[00:00:30.360]Constraints are very important,
[00:00:32.110]because they add a safety factor
[00:00:34.060]on any reinforcement learning agent.
[00:00:36.220]They can also potentially reduce the learning time
[00:00:38.350]and save money and improve on the learning of the agent.
[00:00:43.450]And the purpose of this research
[00:00:45.470]is first to make RL agent learn better, save on GPU hours.
[00:00:52.530]Because we get, if we're eventually gonna save on GPU time
[00:00:56.760]and also improve safety and AI controlled devices.
[00:01:03.350]The motivation is that when we add
[00:01:07.500]one or multiple constraint on let's say,
[00:01:10.810]a car equipped with AI, we want the car to know
[00:01:13.780]that the action that it's about to take
[00:01:16.150]is bad before it even takes that action.
[00:01:20.390]For instance, we want to constrain a car
[00:01:23.670]from running over a pedestrian, not let it learn
[00:01:26.970]that he shouldn't run over a pedestrian by experience.
[00:01:31.460]So, for some background, well, we defined environment
[00:01:35.480]as where the agent is placed to learn.
[00:01:39.970]For instance, if you have a self driving car,
[00:01:42.580]environment would be the road or the city the car is in.
[00:01:47.770]And the agent is the learner,
[00:01:50.260]a player inside of a game, a driving car inside of the road.
[00:01:55.420]The reward is the bonus it gets from performing well.
[00:01:58.970]So, for instance you can be,
[00:02:01.620]the reward can be the score if it's inside of the game.
[00:02:05.170]And it can be predefined as anything
[00:02:07.390]if it's inside of the street or whatever else.
[00:02:11.380]The constraints are the actions the agent cannot perform.
[00:02:15.150]So, those actions are he's forbidden from making those,
[00:02:20.150]from doing those actions.
[00:02:22.730]And that's where the safety component comes in.
[00:02:25.770]Because if he can't do those actions,
[00:02:28.520]he just won't do unsafe things
[00:02:30.640]that we already know are unsafe.
[00:02:32.670]And the augmentation is the process that help the agent
[00:02:36.040]take the constraints into account.
[00:02:37.780]Here we have a little diagram that's gonna show
[00:02:41.320]how the agent and the environment interact.
[00:02:44.240]So, the agent performs actions on their environment
[00:02:47.330]and then the environment gives him information on the state
[00:02:50.390]and the reward he performed after those action.
[00:02:52.670]So, after each state zero, the agent performs an action zero
[00:02:57.140]and gets a reward zero.
[00:02:58.380]Which leads to the state one, if the agent gets ac...
[00:03:02.010]performs action one and gets wrong one et cetera.
[00:03:05.860]So, the material and methods we use for this study
[00:03:09.380]are we use the baseline library on GitHub
[00:03:13.540]to run all the tests and algorithm necessary.
[00:03:16.000]For the environment, we want to use Breakout and SeaQuest,
[00:03:19.960]but for this results, we only have the results for Breakout.
[00:03:26.270]And those are the actual Atari game Breakout and SeaQuest.
[00:03:30.590]As hyperparameters, so we have some hyperparameters
[00:03:34.340]that are called R-value to randomize
[00:03:36.330]the learning process of the agent.
[00:03:38.020]So, we use four different R-values,
[00:03:40.090]0.01, 0.0025, 0.005 and 0.001.
[00:03:46.930]And for each one of them, we ran 100 different agents
[00:03:50.480]and tested 100 different agent for every single one for.
[00:03:55.662]And for the test, for each environment,
[00:03:57.370]we tested every R-value on environment.
[00:04:00.830]For each R-value, we generate 10 different seed
[00:04:04.940]and each seed was tested 10 to separate times,
[00:04:09.010]which makes 100 per hyperparameter.
[00:04:13.496]And for the results, the early results we got show that,
[00:04:18.090]so, baseline is the base library
[00:04:20.430]where they get the base reward,
[00:04:22.190]with no constraint and no obligation.
[00:04:24.630]Here with only the dithering constraints,
[00:04:28.726]the dithering constraint which is moving back and forth,
[00:04:31.130]again and again, we get around 270 of reward
[00:04:37.220]inside of the breakout game, with actuation.
[00:04:39.740]Only actuation constraint is when the agent performs
[00:04:43.660]one thing again and again and again, we get around 280,
[00:04:48.070]and the amount of reward that we get.
[00:04:50.760]But with the combination of actuation
[00:04:53.050]and dithering, the learning results
[00:04:55.070]show that we go all the way up to 390,
[00:04:58.490]in the average work gain, which is a big leap,
[00:05:01.180]from any of the reward we got before.
[00:05:03.150]So, this shows that it adds both safety,
[00:05:07.750]and also help in the learning process.
[00:05:12.100]And combination of constraint that would help the agent
[00:05:18.570]would in fact help the agent learn better.
[00:05:22.910]And a combination that hurts the agent,
[00:05:25.860]might not have any difference
[00:05:27.440]on the performance of the agent.
[00:05:29.410]So, in conclusion, we are concluding right
[00:05:31.870]and as of right now, that adding more constraints
[00:05:34.010]help in the learning process of the agent,
[00:05:36.380]but it also helps on the safety,
[00:05:38.900]which is a given only by having the constraints.
[00:05:42.920]And in the future, we want to design additional constraint
[00:05:46.160]that are better suited for each environment that we test.
[00:05:49.070]We wanna run more tests in more environments,
[00:05:52.210]in order to have more concrete results.
[00:05:54.840]And we want to test with more constraint,
[00:05:59.040]more than only two constraints,
[00:06:01.220]just to see if there is a limit
[00:06:02.910]on how many constraints that we can have.
[00:06:05.830]Or if there is a limit on how constraint,
[00:06:08.570]how the agent perform after certain number of constraints.
[00:06:11.290]One acknowledgement, I want to thank Dr. Steven Scott,
[00:06:14.250]which, who is my faculty mentor here,
[00:06:17.910]Eleanor Quint, who has been helping me all this time,
[00:06:21.020]and the UNL McNair Scholar Program.
[00:06:23.480]Thank you for listening.

The screen size you are trying to search captions on is too small!

You can always jump over to MediaHub and check it out there.

Tags:
artificial intelligence
constraints
dithering
reinforcement learning

Comments

0 Comments

The Effects of Applying Multiple Constraints in Reinforcement Learning

Description

Searchable Transcript

Comments icon comment

Related Channels

Comments