The Effects of Applying Multiple Constraints in Reinforcement Learning
Serigne Toure
Author
08/03/2020
Added
22
Plays
Description
Video Poster Presentation 2020
Searchable Transcript
Toggle between list and paragraph view.
- [00:00:00.770]Hello, my name is Serigne Toure,
- [00:00:02.590]and I am studying the effects
- [00:00:04.020]of applying multiple constraints in reinforcement learning
- [00:00:06.800]with Dr. Stephen Scott, my research advisor,
- [00:00:12.060]and Eleanor Quint, my graduate student
- [00:00:17.210]that's gonna help me afford this research.
- [00:00:20.720]So, as an introduction, this research,
- [00:00:23.070]studies the relation between reinforcement learning,
- [00:00:25.830]agents, and the constraints we apply to those agents.
- [00:00:30.360]Constraints are very important,
- [00:00:32.110]because they add a safety factor
- [00:00:34.060]on any reinforcement learning agent.
- [00:00:36.220]They can also potentially reduce the learning time
- [00:00:38.350]and save money and improve on the learning of the agent.
- [00:00:43.450]And the purpose of this research
- [00:00:45.470]is first to make RL agent learn better, save on GPU hours.
- [00:00:52.530]Because we get, if we're eventually gonna save on GPU time
- [00:00:56.760]and also improve safety and AI controlled devices.
- [00:01:03.350]The motivation is that when we add
- [00:01:07.500]one or multiple constraint on let's say,
- [00:01:10.810]a car equipped with AI, we want the car to know
- [00:01:13.780]that the action that it's about to take
- [00:01:16.150]is bad before it even takes that action.
- [00:01:20.390]For instance, we want to constrain a car
- [00:01:23.670]from running over a pedestrian, not let it learn
- [00:01:26.970]that he shouldn't run over a pedestrian by experience.
- [00:01:31.460]So, for some background, well, we defined environment
- [00:01:35.480]as where the agent is placed to learn.
- [00:01:39.970]For instance, if you have a self driving car,
- [00:01:42.580]environment would be the road or the city the car is in.
- [00:01:47.770]And the agent is the learner,
- [00:01:50.260]a player inside of a game, a driving car inside of the road.
- [00:01:55.420]The reward is the bonus it gets from performing well.
- [00:01:58.970]So, for instance you can be,
- [00:02:01.620]the reward can be the score if it's inside of the game.
- [00:02:05.170]And it can be predefined as anything
- [00:02:07.390]if it's inside of the street or whatever else.
- [00:02:11.380]The constraints are the actions the agent cannot perform.
- [00:02:15.150]So, those actions are he's forbidden from making those,
- [00:02:20.150]from doing those actions.
- [00:02:22.730]And that's where the safety component comes in.
- [00:02:25.770]Because if he can't do those actions,
- [00:02:28.520]he just won't do unsafe things
- [00:02:30.640]that we already know are unsafe.
- [00:02:32.670]And the augmentation is the process that help the agent
- [00:02:36.040]take the constraints into account.
- [00:02:37.780]Here we have a little diagram that's gonna show
- [00:02:41.320]how the agent and the environment interact.
- [00:02:44.240]So, the agent performs actions on their environment
- [00:02:47.330]and then the environment gives him information on the state
- [00:02:50.390]and the reward he performed after those action.
- [00:02:52.670]So, after each state zero, the agent performs an action zero
- [00:02:57.140]and gets a reward zero.
- [00:02:58.380]Which leads to the state one, if the agent gets ac...
- [00:03:02.010]performs action one and gets wrong one et cetera.
- [00:03:05.860]So, the material and methods we use for this study
- [00:03:09.380]are we use the baseline library on GitHub
- [00:03:13.540]to run all the tests and algorithm necessary.
- [00:03:16.000]For the environment, we want to use Breakout and SeaQuest,
- [00:03:19.960]but for this results, we only have the results for Breakout.
- [00:03:26.270]And those are the actual Atari game Breakout and SeaQuest.
- [00:03:30.590]As hyperparameters, so we have some hyperparameters
- [00:03:34.340]that are called R-value to randomize
- [00:03:36.330]the learning process of the agent.
- [00:03:38.020]So, we use four different R-values,
- [00:03:40.090]0.01, 0.0025, 0.005 and 0.001.
- [00:03:46.930]And for each one of them, we ran 100 different agents
- [00:03:50.480]and tested 100 different agent for every single one for.
- [00:03:55.662]And for the test, for each environment,
- [00:03:57.370]we tested every R-value on environment.
- [00:04:00.830]For each R-value, we generate 10 different seed
- [00:04:04.940]and each seed was tested 10 to separate times,
- [00:04:09.010]which makes 100 per hyperparameter.
- [00:04:13.496]And for the results, the early results we got show that,
- [00:04:18.090]so, baseline is the base library
- [00:04:20.430]where they get the base reward,
- [00:04:22.190]with no constraint and no obligation.
- [00:04:24.630]Here with only the dithering constraints,
- [00:04:28.726]the dithering constraint which is moving back and forth,
- [00:04:31.130]again and again, we get around 270 of reward
- [00:04:37.220]inside of the breakout game, with actuation.
- [00:04:39.740]Only actuation constraint is when the agent performs
- [00:04:43.660]one thing again and again and again, we get around 280,
- [00:04:48.070]and the amount of reward that we get.
- [00:04:50.760]But with the combination of actuation
- [00:04:53.050]and dithering, the learning results
- [00:04:55.070]show that we go all the way up to 390,
- [00:04:58.490]in the average work gain, which is a big leap,
- [00:05:01.180]from any of the reward we got before.
- [00:05:03.150]So, this shows that it adds both safety,
- [00:05:07.750]and also help in the learning process.
- [00:05:12.100]And combination of constraint that would help the agent
- [00:05:18.570]would in fact help the agent learn better.
- [00:05:22.910]And a combination that hurts the agent,
- [00:05:25.860]might not have any difference
- [00:05:27.440]on the performance of the agent.
- [00:05:29.410]So, in conclusion, we are concluding right
- [00:05:31.870]and as of right now, that adding more constraints
- [00:05:34.010]help in the learning process of the agent,
- [00:05:36.380]but it also helps on the safety,
- [00:05:38.900]which is a given only by having the constraints.
- [00:05:42.920]And in the future, we want to design additional constraint
- [00:05:46.160]that are better suited for each environment that we test.
- [00:05:49.070]We wanna run more tests in more environments,
- [00:05:52.210]in order to have more concrete results.
- [00:05:54.840]And we want to test with more constraint,
- [00:05:59.040]more than only two constraints,
- [00:06:01.220]just to see if there is a limit
- [00:06:02.910]on how many constraints that we can have.
- [00:06:05.830]Or if there is a limit on how constraint,
- [00:06:08.570]how the agent perform after certain number of constraints.
- [00:06:11.290]One acknowledgement, I want to thank Dr. Steven Scott,
- [00:06:14.250]which, who is my faculty mentor here,
- [00:06:17.910]Eleanor Quint, who has been helping me all this time,
- [00:06:21.020]and the UNL McNair Scholar Program.
- [00:06:23.480]Thank you for listening.
The screen size you are trying to search captions on is too small!
You can always jump over to MediaHub and check it out there.
Log in to post comments
Embed
Copy the following code into your page
HTML
<div style="padding-top: 56.25%; overflow: hidden; position:relative; -webkit-box-flex: 1; flex-grow: 1;"> <iframe style="bottom: 0; left: 0; position: absolute; right: 0; top: 0; border: 0; height: 100%; width: 100%;" src="https://mediahub.unl.edu/media/13928?format=iframe&autoplay=0" title="Video Player: The Effects of Applying Multiple Constraints in Reinforcement Learning" allowfullscreen ></iframe> </div>
Comments
0 Comments