Reinforcement Learning Specialization - Coursera - course 4 - A Complete Reinforcement Learning System (Capstone)

From University of Alberta. My notes on course 4.
reinforcement learning
deepmind
coursera
Published

June 14, 2021

Coursera website: course 4 - A Complete Reinforcement Learning System (Capstone) of Reinforcement Learning Specialization

my notes on course 1 - Fundamentals of Reinforcement Learning, course 2 - Sample-based Learning Methods, course 3 - Prediction and Control with Function Approximation

specialization roadmap - course 4 - A Complete Reinforcement Learning System (Capstone) (syllabus)

Week 1 - Welcome to the Course Week 2 - Formalize Word Problem as MDP Week 3 - Choosing The Right Algorithm Week 4 - Identify Key Performance Parameters Week 5 - Implement Your Agent Week 6 - Submit Your Parameter Study!

Course 4 - Week 2 - Formalize Word Problem as MDP

Final Project: Milestone 1

Video Initial Project Meeting with Martha: Formalizing the Problem

Video Andy Barto on What are Eligibility Traces and Why are they so named?

By the end of this video, you’ll understand the origin of the idea of eligibility traces and you’ll actually see that you’ve been using a variant of eligibility traces all along.

Project Resources

Video Let’s Review: Markov Decision Processes

By the end of this video, you’ll be able to understand Markov decision processes or MDPs and describe how the dynamics of MDP are defined.

Video Let’s Review: Examples of Episodic and Continuing Tasks

By the end of this video, you will be able to understand when to formalize a task as episodic or continuing.

Assignment

MoonShot Technologies

notebooks in github

Course 4 - Week 3 - Choosing The Right Algorithm

Weekly Learning Goals

Video Meeting with Niko: Choosing the Learning Algorithm

Project Resources

Video Let’s Review: Expected Sarsa

Video Let’s Review: What is Q-learning?

Video Let’s Review: Average Reward- A New Way of Formulating Control Problems

Video Let’s Review: Actor-Critic Algorithm

Video Csaba Szepesvari on Problem Landscape

Video Andy and Rich: Advice for Students

Course 4 - Week 4 - Identify Key Performance Parameters

Weekly Learning Goals

Video Agent Architecture Meeting with Martha: Overview of Design Choices

Now, let’s discuss the meta parameter choices that you will have to make to fully implement the agent. This means we need to decide on the function approximator, choices in the optimizer for updating the action values, and how to do exploration.

Project Resources

Video Let’s Review: Non-linear Approximation with Neural Networks

By the end of this video, you will understand how neural networks do feature construction, and you will understand how neural networks are a non-linear function of state.

Video Drew Bagnell on System ID + Optimal Control

Video Susan Murphy on RL in Mobile Health

Course 4 - Week 5 - Implement your agent

Weekly Learning Goals

Video Meeting with Adam: Getting the Agent Details Right

Project Resources

Video Let’s Review: Optimization Strategies for NNs

By the end of this video, you will be able to understand the importance of initialization for neural networks and describe optimization techniques for training neural networks.

One simple yet effective initialization strategy, is to randomly sample the initial weights from a normal distribution with small variance. This way, each neuron has a different output from other neurons within its layer. This provides a more diverse set of potential features. By keeping the variants small, we ensure that the output of each neuron is within the same range as its neighbors. One downside to this strategy is that, as we add more inputs to a neuron, the variance of the output grows. We can get around this issue by scaling the variance of the weights, by one over the square root of the number of inputs.

Here’s the stochastic gradient descent update rule and here’s the update modified to include momentum. Notice, it is similar to the regular stochastic gradient descent update plus an extra term called the momentum M. The momentum term summarizes the history of the gradients using a decaying sum of gradients with decay rate Lambda. If recent gradients have all been in similar directions, then we gained momentum in that direction. This means, we make a large step in that direction. If recent updates have conflicting directions, then it kills the momentum. The momentum term will have little impact on the update and we will make a regular gradient descent step. Momentum provably accelerates learning, meaning it gets to a stationary point more quickly.

So far, we have only talked about a global scalar step size. This is well-known to be problematic because this can result in updates that are too big for some weights and too small for other weights. Adapting the step sizes for each weight, based on statistics about the learning process in practice results in much better performance. Now, how does the update change? The change is very simple. Instead of updating with a scalar Alpha, there’s a vector of step sizes indexed by t to indicate that it can change on each time-step. Each dimension of the gradient, is scaled by its corresponding step size instead of the global step size. There are a variety of methods to adapt a vector of step sizes. You’ll get to implement one in your assignment.

Video Let’s Review: Expected Sarsa with Function Approximation

By the end of this video, you’ll be able to explain the update for expected Sarsa with function approximation, and explain the update for Q-learning with function approximation.

Video Let’s Review: Dyna & Q-learning in a Simple Maze

By the end of this video you will be able to describe how learning from both real and model experience impacts performance. You will also be able to explain how a model allows the agent to learn from fewer interactions with the environment.

Video Meeting with Martha: In-depth on Experience Replay

In Course 3, the agents you implemented update the value function or policy only once with each sample. But this is likely not the most sample efficient way to use our data. You have actually seen a smarter approach in Course 2 where we talked about Dyna as a way to be more sample efficient. But we only talked about Dyna for the tabular setting.

In this video, we will talk about how to make your agent more sample efficient when using function approximation. We will discuss a simple method called experience replay and how it relates to Dyna. To get some intuition for experience replay, let’s first remember a method that we know well, Dyna-Q. The idea is to learn a model using sample experience. Then simulated experience can be obtained from this model to update the values. This procedure of using simulated experience to improve the value estimates is called planning.

Experience replay is a simple method for trying to get the advantages of Dyna. The basic idea is to save a buffer of experience and let the data be the model. We sample experience from this buffer and update the value function with those samples similarly to how we sample from the model and update the values in Dyna.

Video Martin Riedmiller on The ‘Collect and Infer’ framework for data-efficient RL

Martin Riedmiller, head of the control team at Deepmind has been working for more than 20 years on New Reinforcement Learning Agents for the control of dynamical systems.

The control of dynamical systems is an attractive application area for reinforcement learning controllers. They all share the same principle feedback control structure, a controller gets the observation, computes an action and applies it to the environment. Classical control theory would first model the process as a set of differential equations for which then a control law must be analytically derived. A tedious job in particular if the systems are complex or highly nonlinear. Reinforcement learning in contrast promises to be able to learn the controller autonomously. If only the overall control goal is specified. This is typically done by defining the immediate reward. The RL controller optimizes the expected cumulated sum of rewards over time.

These two steps together build the so-called collecting and infer framework of reinforcement learning. This perspective keeps us focused on the two main question of data efficient RL. Infer, which means squeezing out the most of a given set of transition data. And collect, which means sampling the most formative data from the environment.

Assignment

Implement your agent

notebooks in github

Course 4 - Week 6 - Submit your Parameter Study!

Weekly Learning Goals

Video Meeting with Adam: Parameter Studies in RL

Project Resources

Video Let’s Review: Comparing TD and Monte Carlo

Video Joelle Pineau about RL that Matters

Assignment

Completing the parameter study

notebooks in github

Congratulations!

Video Meeting with Martha: Discussing Your Results

Video Course Wrap-up

Video Specialization Wrap-up