https://rlvs.aniti.fr/

Schedule is

RLVS schedule

This condensed schedule does not include class breaks and social events. Times are Central European Summer Time (UTC+2).

Schedule      
March 25th 9:00-9:10 Opening remarks S. Gerchinovitz
  9:10-9:30 RLVS Overview E. Rachelson
  9:30-13:00 RL fundamentals E. Rachelson
  14:00-16:00 Introduction to Deep Learning D. Wilson
  16:30-17:30 Reward Processing Biases in Humans and RL Agents I. Rish
  17:45-18:45 Introduction to Hierarchical Reinforcement Learning D. Precup
March 26th 10:00-12:00 Stochastic bandits T. Lattimore
  14:00-16:00 Monte Carlo Tree Search T. Lattimore
  16:30-17:30 Multi-armed bandits in clinical trials D. A. Berry
April 1st 9:00-15:00 Deep Q-Networks and its variants B. Piot, C. Tallec
  15:15-16:15 Regularized MDPs M. Geist
  16:30-17:30 Regret bounds of model-based reinforcement learning M. Wang
April 2nd 9:00-12:30 Policy Gradients and Actor Critic methods O. Sigaud
  14:00-15:00 Pitfalls in Policy Gradient methods O. Sigaud
  15:30-17:30 Exploration in Deep RL M. Pirotta
April 8th 9:00-11:00 Evolutionary Reinforcement Learning D. Wilson, J.-B. Mouret
  11:30-12:30 Evolving Agents that Learn More Like Animals S. Risi
  14:00-16:00 Micro-data Policy Search K. Chatzilygeroudis, J.-B. Mouret
  16:30-17:30 Efficient Motor Skills Learning in Robotics D. Lee
April 9th 9:00-13:00 RL tips and tricks A. Raffin
  14:30-15:30 Symbolic representations and reinforcement learning M. Garnelo
  15:45-16:45 Leveraging model-learning for extreme generalization L. P. Kaelbling
  17:00-18:00 RLVS wrap-up E. Rachelson

(4/1/21) - Deep Q-Networks and its variants

Speaker is Bilal Piot.

Deep Q network as a solution for a practicable control theory.

Introduction of ALE (Atari Learning Environment)

DQN is (almost) end-to-end: from raw observations to actions. Bilal explains the preprocessing part (from 160x210x3 to 84x84 + stacking 4 frames + downsampling to 15 Hz)

Value Iteration (VI) algorithm: Recurrent algorithm to get Q. $Q_{k+1}=T^*Q$

But it is not practical in a real-world case. What we can do is use interactions with real world. And estimate $Q^*$ using a regression.

Would be interesting to have slides. I like the link between regression notations and VI notation.

From neural Fitted-$Q$ to DQN. Main difference is data collection (in DQN you have updated interactions and it allows exploration, and size of architecture)

With DQN we have acting part and learning part. Acting is the data collection. (using $\epsilon$-greedy policy)

hands-on based on DQN tutorial notebook.

had to export LD_LIBRARY_PATH=/home/explore/miniconda3/envs/aniti/lib/

Nice introduction to JAX and haiku. Haiku is similar modules in pytorch and can turn NN into pure version. Which is useful for Jax.

overview of the literature

(4/2/21) - From Policy Gradients to Actor Critic methods

Olivier Sigaud is the speaker.

He has pre-recorded his lecture in videos. I have missed the start so I will have to watch them later.

Policy Gradient in pratice

Don’t become an alchemist ;)

As stochastic policies, squashed gaussian is interesting because it allows continuous variable + bounds.

Exploration in Deep RL

(4/8/21) - Evolutionary Reinforcement Learning

pdf version of the slides are available here

then Evolving Agents that Learn More Like Animals

This morning was more about what we can do when we have infinite calculation power and data.

Afternoon will be the opposite.

Most policy search algorithms require thousands of training episodes to find an effective policy, which is often infeasible when experiments takes time or are expensive (for instance, with physical robot or with an aerodynamics simulator). This class focuses on the extreme other end of the spectrum: how can an algorithm adapt a policy with only a handful of trials (a dozen) and a few minutes? By analogy with the word “big-data”, we refer to this challenge as “micro-data reinforcement learning”. We will describe two main strategies: (1) leverage prior knowledge on the policy structure (e.g., dynamic movement primitives), on the policy parameters (e.g., demonstrations), or on the dynamics (e.g., simulators), and (2) create data-driven surrogate models of the expected reward (e.g., Bayesian optimization) or the dynamical model (e.g., model-based policy search), so that the policy optimizer queries the model instead of the real system. Most of the examples will be about robotic systems, but the principle apply to any other expensive setup.

all material: https://rl-vs.github.io/rlvs2021/micro-data.html

(4/9/21) - RL in Practice: Tips and Tricks and Practical Session With Stable-Baselines3

Abstract: The aim of the session is to help you do reinforcement learning experiments. The first part covers general advice about RL, tips and tricks and details three examples where RL was applied on real robots. The second part will be a practical session using the Stable-Baselines3 library.

Pre-requisites: Python programming, RL basics, (recommended: Google account for the practical session in order to use Google Colab).

Additional material: Website: https://github.com/DLR-RM/stable-baselines3 Doc: https://stable-baselines3.readthedocs.io/en/master/

Outline: Part I: RL Tips and Tricks / The Challenges of Applying RL to Real Robots

  1. Introduction (3 minutes)
  2. RL Tips and tricks (45 minutes)
    1. General Nuts and Bolts of RL experimentation (10 minutes)
    2. RL in practice on a custom task (custom environment) (30 minutes)
    3. Questions? (5 minutes)
  3. The Challenges of Applying RL to Real Robots (45 minutes)
    1. Learning to control an elastic robot - DLR David Neck Example (15 minutes)
    2. Learning to drive in minutes and learning to race in hours - Virtual and real racing car (15 minutes)
    3. Learning to walk with an elastic quadruped robot - DLR bert example (10 minutes)
    4. Questions? (5 minutes+)

Part II: Practical Session with Stable-Baselines3

  1. Stable-Baselines3 Overview (20 minutes)
  2. Questions? (5 minutes)
  3. Practical Session - Code along (1h+)

action space

When using continuous space, you need to normalize! (normalized action space -1, -1)

there is a checker for that in stable baselines 3.

reward

start with reward shaping.

termination condition

early stopping makes learning faster (and safer for robots)

for hyperparameter tuning, Antonin recommends Optuna.

about the Henderson paper: Deep Reinforcement Learning that Matters

and then the controller will use latent representation / current speed + history as observation space.

Learning to drive takes then 10 min, and to race 2 hours.

handson

slides: https://araffin.github.io/slides/rlvs-sb3-handson/

notebook: https://github.com/araffin/rl-handson-rlvs21

RL zoo: https://github.com/DLR-RM/rl-baselines3-zoo

documentation for SB3 usefull for completing exercises: https://stable-baselines3.readthedocs.io/en/master/

https://excalidraw.com/