First things first, download and install Julia of the latest stable version. ReinforcementLearning.jl is tested on all platforms, so just choose the one you are familiar with. If you already have Julia installed, please make sure that it is 1.5.3 or above.
Another useful tool is tensorboard You don't need to install the whole TensorFlow to use the TensorBoard. Behind the scene, ReinforcementLearning.jl uses TensorBoardLogger.jl to write data into the format that TensorBoard recognizes.. You can install it via pip install tensorboard
with the python package installer pip
.
Run julia
in the command line (or double-click the Julia executable) and now you are in an interactive session (also known as a read-eval-print loop or "REPL"). Then execute the following code:
] add ReinforcementLearning
using ReinforcementLearning
run(E`JuliaRL_BasicDQN_CartPole`)
So what's happening here?
In the first line, typing ]
will bring you to the Pkg mode. add ReinforcementLearning
will install the latest version of ReinforcementLearning.jl
for you. And then remember to press backspace or ^C to get back to the normal mode. All examples in this website are built with ReinforcementLearning
of version 0.8.0 . Note that sometimes you may have an old version installed. The reason is that some of the packages you have installed in your current Julia environment have an outdated dependency, resulting in a downgraded install of ReinforcementLearning.jl
. You can confirm it by installing the latest master branch with ] add ReinforcementLearning#master
. To solve this problem, you can create a temporary directory and then activate the Julia environment there with ] activate /path/to/tmp/dir
.
using ReinforcementLearning
will bring the names exported in ReinforcementLearning
into global scope. If this is your first time to run, you'll see precompiling ReinforcementLearning. And it may take a while.
The third line means, run
a predefined Experiment named JuliaRL_BasicDQN_CartPole
The E`JuliaRL_BasicDQN_CartPole`
is a handy command literal to instantiate a prebuilt experiment..
CartPole is considered to be one of the simplest environments for DRL (Deep Reinforcement Learning) algorithms testing. The state of the CartPole environment can be described with 4 numbers and the actions are two integers(1
and 2
). Before game terminates, agent can gain a reward of +1
for each step. By default, the game will be forced to terminate after 200 steps, thus the maximum reward of an episode is 200
.
While the experiment is running, you'll see the following information and a progress bar. The information may be slightly different based on your platform and your current working directory. Note that the first run would be slow. On a modern computer, the experiment should be finished in a minute.
This experiment uses three dense layers to approximate the Q value.
The testing environment is CartPoleEnv.
You can view the runtime logs with `tensorboard --logdir /home/runner/work/JuliaReinforcementLearning.github.io/JuliaReinforcementLearning.github.io/checkpoints/JuliaRL_BasicDQN_CartPole_2021_02_22_02_44_54/tb_log`.
Some useful statistics are stored in the `hook` field of this experiment.
Follow the instruction above and run tensorboard --logdir /the/path/shown/above
, then a link will be prompted (typically it's http://YourHost:6006/
). Now open it in your browser, you'll see a webpage similar to the following one:
Here two important variables are logged: training loss per update and total reward of each episode during training. As you can see, our agent can reach the maximum reward after training for about 4k steps. Now that you already know how to run the experiment of BasicDQN algorithm with the CartPole environment. You are suggested to try some other experiments below to compare the performance of different algorithms For the full list of supported algorithms, please visit ReinforcementLearningZoo.jl:
E`JuliaRL_BasicDQN_CartPole`
E`JuliaRL_DQN_CartPole`
E`JuliaRL_PrioritizedDQN_CartPole`
E`JuliaRL_Rainbow_CartPole`
E`JuliaRL_IQN_CartPole`
E`JuliaRL_A2C_CartPole`
E`JuliaRL_A2CGAE_CartPole`
E`JuliaRL_PPO_CartPole`
Now let's take a closer look at what's in an experiment.
ReinforcementLearningCore.Experiment
├─ policy => ReinforcementLearningCore.Agent
│ ├─ policy => ReinforcementLearningCore.QBasedPolicy
│ │ ├─ learner => ReinforcementLearningZoo.BasicDQNLearner
│ │ │ ├─ approximator => ReinforcementLearningCore.NeuralNetworkApproximator
│ │ │ │ ├─ model => Flux.Chain
│ │ │ │ │ └─ layers
│ │ │ │ │ ├─ 1
│ │ │ │ │ │ └─ Flux.Dense
│ │ │ │ │ │ ├─ W => 128×4 Array{Float32,2}
│ │ │ │ │ │ ├─ b => 128-element Array{Float32,1}
│ │ │ │ │ │ └─ σ => typeof(NNlib.relu)
│ │ │ │ │ ├─ 2
│ │ │ │ │ │ └─ Flux.Dense
│ │ │ │ │ │ ├─ W => 128×128 Array{Float32,2}
│ │ │ │ │ │ ├─ b => 128-element Array{Float32,1}
│ │ │ │ │ │ └─ σ => typeof(NNlib.relu)
│ │ │ │ │ └─ 3
│ │ │ │ │ └─ Flux.Dense
│ │ │ │ │ ├─ W => 2×128 Array{Float32,2}
│ │ │ │ │ ├─ b => 2-element Array{Float32,1}
│ │ │ │ │ └─ σ => typeof(identity)
│ │ │ │ └─ optimizer => Flux.Optimise.ADAM
│ │ │ │ ├─ eta => 0.001
│ │ │ │ ├─ beta
│ │ │ │ │ ├─ 1
│ │ │ │ │ │ └─ 0.9
│ │ │ │ │ └─ 2
│ │ │ │ │ └─ 0.999
│ │ │ │ └─ state => IdDict
│ │ │ ├─ loss_func => typeof(Flux.Losses.huber_loss)
│ │ │ ├─ γ => 0.99
│ │ │ ├─ sampler => ReinforcementLearningCore.BatchSampler
│ │ │ │ └─ batch_size => 32
│ │ │ ├─ min_replay_history => 100
│ │ │ ├─ rng => StableRNGs.LehmerRNG
│ │ │ └─ loss => 0.0
│ │ └─ explorer => ReinforcementLearningCore.EpsilonGreedyExplorer
│ │ ├─ ϵ_stable => 0.01
│ │ ├─ ϵ_init => 1.0
│ │ ├─ warmup_steps => 0
│ │ ├─ decay_steps => 500
│ │ ├─ step => 1
│ │ ├─ rng => StableRNGs.LehmerRNG
│ │ └─ is_training => true
│ └─ trajectory => ReinforcementLearningCore.Trajectory
│ └─ traces => NamedTuple
│ ├─ state => 4×0 CircularArrayBuffers.CircularArrayBuffer{Float32,2}
│ ├─ action => 0-element CircularArrayBuffers.CircularArrayBuffer{Int64,1}
│ ├─ reward => 0-element CircularArrayBuffers.CircularArrayBuffer{Float32,1}
│ └─ terminal => 0-element CircularArrayBuffers.CircularArrayBuffer{Bool,1}
├─ env => ReinforcementLearningEnvironments.CartPoleEnv
├─ stop_condition => ReinforcementLearningCore.StopAfterStep
│ ├─ step => 10000
│ ├─ cur => 1
│ └─ progress => ProgressMeter.Progress
├─ hook => ReinforcementLearningCore.ComposedHook
│ └─ hooks
│ ├─ 1
│ │ └─ ReinforcementLearningCore.TotalRewardPerEpisode
│ │ ├─ rewards => 0-element Array{Float64,1}
│ │ └─ reward => 0.0
│ ├─ 2
│ │ └─ ReinforcementLearningCore.TimePerStep
│ │ ├─ times => 0-element CircularArrayBuffers.CircularArrayBuffer{Float64,1}
│ │ └─ t => 696140120366
│ ├─ 3
│ │ └─ ReinforcementLearningCore.DoEveryNStep
│ │ ├─ f => ReinforcementLearningZoo.var"#334#338"
│ │ ├─ n => 1
│ │ └─ t => 0
│ └─ 4
│ └─ ReinforcementLearningCore.DoEveryNEpisode
│ ├─ f => ReinforcementLearningZoo.var"#336#340"
│ ├─ n => 1
│ └─ t => 0
└─ description => "This experiment uses three dense layers to approximate the Q value...."
In the highest level, each experiment contains the following four parts:
The relation between agent and env. The agent takes in an environment and feed an action back. This process repeats until a stop condition meets. In each step, the agent needs to improve its policy in order to maximize the expected total reward. When executing run(E`JuliaRL_BasicDQN_CartPole`)
, it will be dispatched to run(agent, env, stop_condition, hook)
. So it's just the same as running the following lines:
experiment = E`JuliaRL_BasicDQN_CartPole`
agent = experiment.policy
env = experiment.env
stop_condition = experiment.stop_condition
hook = experiment.hook
run(agent, env, stop_condition, hook)
Now let's explain these components one by one.
A stop condition is used to determine when to stop an experiment. Two typical ones are StopAfterStep
and StopAfterEpisode
. As you may have seen, the above experiment uses StopAfterStep(10_000)
as the stop condition. Try to change the stop condition and see if it works as expected.
experiment = E`JuliaRL_BasicDQN_CartPole`
run(experiment.policy, experiment.env, StopAfterEpisode(100), experiment.hook)
At some point, you may need to learn how write a customized stop condition.
The concept of hook in ReinforcementLearning.jl
is mainly inspired by the two-way callbacks in FastAI :
A callback should be available at every single point that code can be run during training, so that a user can customise every single detail of the training method;
Every callback should be able to access every piece of information available at that stage in the training loop, including hyper-parameters, losses, gradients, input and target data, and so forth;
In fact, we extend the first kind of callback further in ReinforcementLearning.jl
. Thanks to multiple-dispatch in Julia, we can easily customize the behavior of every detail in training, testing, evaluating stages.
You can check the list of provided hooks here. Two common hooks are TotalRewardPerEpisode
and StepsPerEpisode
.
experiment = E`JuliaRL_BasicDQN_CartPole`
hook = TotalRewardPerEpisode()
run(experiment.policy, experiment.env, experiment.stop_condition, hook)
plot(hook.rewards)
Total reward of each episode during training. Still wondering how the tensorboard logging data is generated? Learn how to use tensorboard and how to write a customized hook.
An agent is an instance of AbstractPolicy
. It is a functional object which takes in an environment and returns an action.
action = agent(env)
In the above experiment, the agent
is of type Agent
, which is one of the most common policies in this package. We'll study how to create, modify and update an agent in detail later. Suppose now we want to apply another policy to the cart pole environment, a simple random policy. We can simply replace the first argument with RandomPolicy([1, 2])
. Here [1,2]
are valid actions to the CartPoleEnv
.
using ReinforcementLearning
experiment = E`JuliaRL_BasicDQN_CartPole`
run(RandomPolicy([1,2]), experiment.env, experiment.stop_condition, experiment.hook)
println(experiment.description)
Just like what you did above, you can now watch the result based on the description of the experiment.
We've been using the CartPoleEnv
for all the experiments above. But what does it look like? By printing it in the REPL, we can see a lot of information about it. Each of them are clearly described in interface.jl.
env = CartPoleEnv()
# CartPoleEnv
## Traits
| Trait Type | Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle | ReinforcementLearningBase.SingleAgent() |
| DynamicStyle | ReinforcementLearningBase.Sequential() |
| InformationStyle | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle | ReinforcementLearningBase.Stochastic() |
| RewardStyle | ReinforcementLearningBase.StepReward() |
| UtilityStyle | ReinforcementLearningBase.GeneralSum() |
| ActionStyle | ReinforcementLearningBase.MinimalActionSet() |
| StateStyle | ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle | ReinforcementLearningBase.Observation{Any}() |
## Is Environment Terminated?
No
## State Space
`ReinforcementLearningBase.Space{Array{IntervalSets.Interval{:closed,:closed,Float64},1}}(IntervalSets.Interval{:closed,:closed,Float64}[-4.8..4.8, -1.0e38..1.0e38, -0.41887902047863906..0.41887902047863906, -1.0e38..1.0e38])`
## Action Space
`Base.OneTo(2)`
## Current State
```
[0.034057600857863024, -0.0035062727636622423, -0.003784234694028263, 0.04351977371929182]
```
Some people coming from the Python world may be familiar with the APIs defined in OpenAI/Gym. Ours are very similar to them for simple environments:
reset!(env)
state(env)
reward(env)
is_terminated(env)
actions(env)
env(rand(actions(env)))
However, our package has a more ambitious goal to support much more complicated environments. You may take a look at ReinforcementLearningEnvironments.jl to see some more built in examples. For users who are interested in applying algorithms in this package to their own problems, you may also read the detailed description for how to write a customized environment.
We have introduced the four main concepts in the ReinforcementLearning.jl
package. I hope you have a better understanding of them now.
For starters who would like to learn reinforcement learning, I'd suggest you start from ReinforcementLearningAnIntroduction.jl. If you are already familiar with traditional tabular reinforcement learning algorithms, then go ahead to ReinforcementLearningZoo.jl to explore those DRL related experiments. Try to modify the parameters and compare the different results.
For general users who want to use existing algorithms in our package to their customized environments, first learn skim through games defined in ReinforcementLearningEnvironments.jl to learn how to describe the problem you are going to deal with. Then choose the appropriate policy in ReinforcementLearningZoo.jl and tune the hyparameters. The Guide page may help you understand how each component is connected with others.
For algorithm designers who want to contribute new algorithms, you're suggested to read the blog to understand the design principles and best practices.