A Whirlwind Tour of ReinforcementLearning.jl

Welcome to the world of reinforcement learning in Julia. Now let's get started in 3 lines!

Prepare

First things first, download and install Julia of the latest stable version. ReinforcementLearning.jl is tested on all platforms, so just choose the one you are familiar with. If you already have Julia installed, please make sure that it is 1.5.4 or above.

Another useful tool is tensorboard You don't need to install the whole TensorFlow to use the TensorBoard. Behind the scene, ReinforcementLearning.jl uses TensorBoardLogger.jl to write data into the format that TensorBoard recognizes.. You can install it via pip install tensorboard with the python package installer pip.

Get Started

Run julia in the command line (or double-click the Julia executable) and now you are in an interactive session (also known as a read-eval-print loop or "REPL"). Then execute the following code:

] add ReinforcementLearning

using ReinforcementLearning

run(E`JuliaRL_BasicDQN_CartPole`)

So what's happening here?

  1. In the first line, typing ] will bring you to the Pkg mode. add ReinforcementLearning will install the latest version of ReinforcementLearning.jl for you. And then remember to press backspace or ^C to get back to the normal mode. All examples in this website are built with ReinforcementLearning of version 0.8.0 . Note that sometimes you may have an old version installed. The reason is that some of the packages you have installed in your current Julia environment have an outdated dependency, resulting in a downgraded install of ReinforcementLearning.jl. You can confirm it by installing the latest master branch with ] add ReinforcementLearning#master. To solve this problem, you can create a temporary directory and then activate the Julia environment there with ] activate /path/to/tmp/dir.

  2. using ReinforcementLearning will bring the names exported in ReinforcementLearning into global scope. If this is your first time to run, you'll see precompiling ReinforcementLearning. And it may take a while.

  3. The third line means, run a predefined Experiment named JuliaRL_BasicDQN_CartPole The E`JuliaRL_BasicDQN_CartPole` is a handy command literal to instantiate a prebuilt experiment..

CartPole is considered to be one of the simplest environments for DRL (Deep Reinforcement Learning) algorithms testing. The state of the CartPole environment can be described with 4 numbers and the actions are two integers(1 and 2). Before game terminates, agent can gain a reward of +1 for each step. By default, the game will be forced to terminate after 200 steps, thus the maximum reward of an episode is 200.

While the experiment is running, you'll see the following information and a progress bar. The information may be slightly different based on your platform and your current working directory. Note that the first run would be slow. On a modern computer, the experiment should be finished in a minute.

This experiment uses three dense layers to approximate the Q value.
The testing environment is CartPoleEnv.

You can view the runtime logs with `tensorboard --logdir /home/runner/work/JuliaReinforcementLearning.github.io/JuliaReinforcementLearning.github.io/checkpoints/JuliaRL_BasicDQN_CartPole_2021_03_15_01_31_12/tb_log`.
Some useful statistics are stored in the `hook` field of this experiment.

Follow the instruction above and run tensorboard --logdir /the/path/shown/above, then a link will be prompted (typically it's http://YourHost:6006/). Now open it in your browser, you'll see a webpage similar to the following one:

Here two important variables are logged: training loss per update and total reward of each episode during training. As you can see, our agent can reach the maximum reward after training for about 4k steps.

Exercise

Now that you already know how to run the experiment of BasicDQN algorithm with the CartPole environment. You are suggested to try some other experiments below to compare the performance of different algorithms For the full list of supported algorithms, please visit ReinforcementLearningZoo.jl:

Basic Components

Now let's take a closer look at what's in an experiment.

ReinforcementLearningCore.Experiment
├─ policy => ReinforcementLearningCore.Agent
│  ├─ policy => ReinforcementLearningCore.QBasedPolicy
│  │  ├─ learner => ReinforcementLearningZoo.BasicDQNLearner
│  │  │  ├─ approximator => ReinforcementLearningCore.NeuralNetworkApproximator
│  │  │  │  ├─ model => Flux.Chain
│  │  │  │  │  └─ layers
│  │  │  │  │     ├─ 1
│  │  │  │  │     │  └─ Flux.Dense
│  │  │  │  │     │     ├─ W => 128×4 Array{Float32,2}
│  │  │  │  │     │     ├─ b => 128-element Array{Float32,1}
│  │  │  │  │     │     └─ σ => typeof(NNlib.relu)
│  │  │  │  │     ├─ 2
│  │  │  │  │     │  └─ Flux.Dense
│  │  │  │  │     │     ├─ W => 128×128 Array{Float32,2}
│  │  │  │  │     │     ├─ b => 128-element Array{Float32,1}
│  │  │  │  │     │     └─ σ => typeof(NNlib.relu)
│  │  │  │  │     └─ 3
│  │  │  │  │        └─ Flux.Dense
│  │  │  │  │           ├─ W => 2×128 Array{Float32,2}
│  │  │  │  │           ├─ b => 2-element Array{Float32,1}
│  │  │  │  │           └─ σ => typeof(identity)
│  │  │  │  └─ optimizer => Flux.Optimise.ADAM
│  │  │  │     ├─ eta => 0.001
│  │  │  │     ├─ beta
│  │  │  │     │  ├─ 1
│  │  │  │     │  │  └─ 0.9
│  │  │  │     │  └─ 2
│  │  │  │     │     └─ 0.999
│  │  │  │     └─ state => IdDict
│  │  │  ├─ loss_func => typeof(Flux.Losses.huber_loss)
│  │  │  ├─ γ => 0.99
│  │  │  ├─ sampler => ReinforcementLearningCore.BatchSampler
│  │  │  │  └─ batch_size => 32
│  │  │  ├─ min_replay_history => 100
│  │  │  ├─ rng => StableRNGs.LehmerRNG
│  │  │  └─ loss => 0.0
│  │  └─ explorer => ReinforcementLearningCore.EpsilonGreedyExplorer
│  │     ├─ ϵ_stable => 0.01
│  │     ├─ ϵ_init => 1.0
│  │     ├─ warmup_steps => 0
│  │     ├─ decay_steps => 500
│  │     ├─ step => 1
│  │     ├─ rng => StableRNGs.LehmerRNG
│  │     └─ is_training => true
│  └─ trajectory => ReinforcementLearningCore.Trajectory
│     └─ traces => NamedTuple
│        ├─ state => 4×0 CircularArrayBuffers.CircularArrayBuffer{Float32,2}
│        ├─ action => 0-element CircularArrayBuffers.CircularArrayBuffer{Int64,1}
│        ├─ reward => 0-element CircularArrayBuffers.CircularArrayBuffer{Float32,1}
│        └─ terminal => 0-element CircularArrayBuffers.CircularArrayBuffer{Bool,1}
├─ env => ReinforcementLearningEnvironments.CartPoleEnv
├─ stop_condition => ReinforcementLearningCore.StopAfterStep
│  ├─ step => 10000
│  ├─ cur => 1
│  └─ progress => ProgressMeter.Progress
├─ hook => ReinforcementLearningCore.ComposedHook
│  └─ hooks
│     ├─ 1
│     │  └─ ReinforcementLearningCore.TotalRewardPerEpisode
│     │     ├─ rewards => 0-element Array{Float64,1}
│     │     └─ reward => 0.0
│     ├─ 2
│     │  └─ ReinforcementLearningCore.TimePerStep
│     │     ├─ times => 0-element CircularArrayBuffers.CircularArrayBuffer{Float64,1}
│     │     └─ t => 1153979825633
│     ├─ 3
│     │  └─ ReinforcementLearningCore.DoEveryNStep
│     │     ├─ f => ReinforcementLearningZoo.var"#334#338"
│     │     ├─ n => 1
│     │     └─ t => 0
│     └─ 4
│        └─ ReinforcementLearningCore.DoEveryNEpisode
│           ├─ f => ReinforcementLearningZoo.var"#336#340"
│           ├─ n => 1
│           └─ t => 0
└─ description => "This experiment uses three dense layers to approximate the Q value...."

In the highest level, each experiment contains the following four parts:

The relation between agent and env. The agent takes in an environment and feed an action back. This process repeats until a stop condition meets. In each step, the agent needs to improve its policy in order to maximize the expected total reward.

When executing run(E`JuliaRL_BasicDQN_CartPole`), it will be dispatched to run(agent, env, stop_condition, hook). So it's just the same as running the following lines:

experiment     = E`JuliaRL_BasicDQN_CartPole`
agent          = experiment.policy
env            = experiment.env
stop_condition = experiment.stop_condition
hook           = experiment.hook

run(agent, env, stop_condition, hook)

Now let's explain these components one by one.

Stop Condition

A stop condition is used to determine when to stop an experiment. Two typical ones are StopAfterStep and StopAfterEpisode. As you may have seen, the above experiment uses StopAfterStep(10_000) as the stop condition. Try to change the stop condition and see if it works as expected.

experiment = E`JuliaRL_BasicDQN_CartPole`
run(experiment.policy, experiment.env, StopAfterEpisode(100), experiment.hook)

At some point, you may need to learn how write a customized stop condition.

Hook

The concept of hook in ReinforcementLearning.jl is mainly inspired by the two-way callbacks in FastAI :

A callback should be available at every single point that code can be run during training, so that a user can customise every single detail of the training method;

Every callback should be able to access every piece of information available at that stage in the training loop, including hyper-parameters, losses, gradients, input and target data, and so forth;

In fact, we extend the first kind of callback further in ReinforcementLearning.jl. Thanks to multiple-dispatch in Julia, we can easily customize the behavior of every detail in training, testing, evaluating stages.

You can check the list of provided hooks here. Two common hooks are TotalRewardPerEpisode and StepsPerEpisode.

experiment = E`JuliaRL_BasicDQN_CartPole`
hook = TotalRewardPerEpisode()
run(experiment.policy, experiment.env, experiment.stop_condition, hook)
plot(hook.rewards)
Total reward of each episode during training.

Still wondering how the tensorboard logging data is generated? Learn how to use tensorboard and how to write a customized hook.

Agent

An agent is an instance of AbstractPolicy. It is a functional object which takes in an environment and returns an action.

action = agent(env)

In the above experiment, the agent is of type Agent, which is one of the most common policies in this package. We'll study how to create, modify and update an agent in detail later. Suppose now we want to apply another policy to the cart pole environment, a simple random policy. We can simply replace the first argument with RandomPolicy([1, 2]). Here [1,2] are valid actions to the CartPoleEnv.

using ReinforcementLearning

experiment = E`JuliaRL_BasicDQN_CartPole`

run(RandomPolicy([1,2]), experiment.env, experiment.stop_condition, experiment.hook)

println(experiment.description)

Just like what you did above, you can now watch the result based on the description of the experiment.

Environment

We've been using the CartPoleEnv for all the experiments above. But what does it look like? By printing it in the REPL, we can see a lot of information about it. Each of them are clearly described in interface.jl.

env = CartPoleEnv()
# CartPoleEnv

## Traits

| Trait Type        |                                            Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle     |          ReinforcementLearningBase.SingleAgent() |
| DynamicStyle      |           ReinforcementLearningBase.Sequential() |
| InformationStyle  | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle       |           ReinforcementLearningBase.Stochastic() |
| RewardStyle       |           ReinforcementLearningBase.StepReward() |
| UtilityStyle      |           ReinforcementLearningBase.GeneralSum() |
| ActionStyle       |     ReinforcementLearningBase.MinimalActionSet() |
| StateStyle        |     ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle |     ReinforcementLearningBase.Observation{Any}() |

## Is Environment Terminated?

No

## State Space

`ReinforcementLearningBase.Space{Array{IntervalSets.Interval{:closed,:closed,Float64},1}}(IntervalSets.Interval{:closed,:closed,Float64}[-4.8..4.8, -1.0e38..1.0e38, -0.41887902047863906..0.41887902047863906, -1.0e38..1.0e38])`

## Action Space

`Base.OneTo(2)`

## Current State

```
[0.03159768120213327, -0.009896066548274705, 0.04003921212414911, -0.01918661021140875]
```

Some people coming from the Python world may be familiar with the APIs defined in OpenAI/Gym. Ours are very similar to them for simple environments:

reset!(env)              # reset env to the initial state
state(env)               # get the state from environment, usually it's a tensor
reward(env)              # get the reward since last interaction with environment
is_terminated(env)       # check if the game is terminated or not
actions(env)             # valid actions
env(rand(actions(env)))  # update the environment's internal state given an action

However, our package has a more ambitious goal to support much more complicated environments. You may take a look at ReinforcementLearningEnvironments.jl to see some more built in examples. For users who are interested in applying algorithms in this package to their own problems, you may also read the detailed description for how to write a customized environment.

What's Next?

We have introduced the four main concepts in the ReinforcementLearning.jl package. I hope you have a better understanding of them now.

Corrections

If you see mistakes or want to suggest changes, please create an issue in the source repository.