# ReinforcementLearningCore.jl

`ReinforcementLearningCore.RLCore`

— ModuleReinforcementLearningCore.jl (**RLCore**) provides some standard and reusable components defined by **RLBase**, hoping that they are useful for people to implement and experiment with different kinds of algorithms.

`ReinforcementLearningCore.AbstractApproximator`

— Type`(app::AbstractApproximator)(env)`

An approximator is a functional object for value estimation. It serves as a black box to provides an abstraction over different kinds of approximate methods (for example DNN provided by Flux or Knet).

`ReinforcementLearningCore.AbstractExplorer`

— Type```
(p::AbstractExplorer)(x)
(p::AbstractExplorer)(x, mask)
```

Define how to select an action based on action values.

`ReinforcementLearningCore.AbstractHook`

— TypeA hook is called at different stage duiring a `run`

to allow users to inject customized runtime logic. By default, a `AbstractHook`

will do nothing. One can override the behavior by implementing the following methods:

`(hook::YourHook)(::PreActStage, agent, env, action)`

, note that there's an extra argument of`action`

.`(hook::YourHook)(::PostActStage, agent, env)`

`(hook::YourHook)(::PreEpisodeStage, agent, env)`

`(hook::YourHook)(::PostEpisodeStage, agent, env)`

`(hook::YourHook)(::PostExperimentStage, agent, env)`

`ReinforcementLearningCore.AbstractLearner`

— Type`(learner::AbstractLearner)(env)`

A learner is usually used to estimate state values, state-action values or distributional values based on experiences.

`ReinforcementLearningCore.AbstractTrajectory`

— Type`AbstractTrajectory`

A trajectory is used to record some useful information during the interactions between agents and environments. It behaves similar to a `NamedTuple`

except that we extend it with some optional methods.

Required Methods:

`Base.getindex`

`Base.keys`

Optional Methods:

`Base.length`

`Base.isempty`

`Base.empty!`

`Base.haskey`

`Base.push!`

`Base.pop!`

`ReinforcementLearningCore.ActorCritic`

— Type`ActorCritic(;actor, critic, optimizer=ADAM())`

The `actor`

part must return logits (*Do not use softmax in the last layer!*), and the `critic`

part must return a state value.

`ReinforcementLearningCore.Agent`

— Type`Agent(;kwargs...)`

A wrapper of an `AbstractPolicy`

. Generally speaking, it does nothing but to update the trajectory and policy appropriately in different stages.

**Keywords & Fields**

`policy`

::`AbstractPolicy`

: the policy to use`trajectory`

::`AbstractTrajectory`

: used to store transitions between an agent and an environment

`ReinforcementLearningCore.Agent`

— MethodHere we extend the definition of `(p::AbstractPolicy)(::AbstractEnv)`

in `RLBase`

to accept an `AbstractStage`

as the first argument. Algorithm designers may customize these behaviors respectively by implementing:

`(p::YourPolicy)(::AbstractStage, ::AbstractEnv)`

`(p::YourPolicy)(::PreActStage, ::AbstractEnv, action)`

The default behaviors for `Agent`

are:

Update the inner

`trajectory`

given the context of`policy`

,`env`

, and`stage`

.By default we do nothing.

In

`PreActStage`

, we`push!`

the current**state**and the**action**into the`trajectory`

.In

`PostActStage`

, we query the`reward`

and`is_terminated`

info from`env`

and push them into`trajectory`

.In the

`PosEpisodeStage`

, we push the`state`

at the end of an episode and a dummy action into the`trajectory`

.In the

`PreEpisodeStage`

, we pop out the latest`state`

and`action`

pair (which are dummy ones) from`trajectory`

.Update the inner

`policy`

given the context of`trajectory`

,`env`

, and`stage`

.By default, we only

`update!`

the`policy`

in the`PreActStage`

. And it's dispatched to`update!(policy, trajectory, env, stage)`

.

`ReinforcementLearningCore.BatchExplorer`

— Type`BatchExplorer(explorer::AbstractExplorer)`

`ReinforcementLearningCore.BatchExplorer`

— Method`(x::BatchExplorer)(values::AbstractMatrix)`

Apply inner explorer to each column of `values`

.

`ReinforcementLearningCore.BatchStepsPerEpisode`

— Method`BatchStepsPerEpisode(batch_size::Int; tag = "TRAINING")`

Similar to `StepsPerEpisode`

, but is specific to environments which return a `Vector`

of rewards (a typical case with `MultiThreadEnv`

).

`ReinforcementLearningCore.ComposedHook`

— Type`ComposedHook(hooks::AbstractHook...)`

Compose different hooks into a single hook.

`ReinforcementLearningCore.ComposedStopCondition`

— Type`ComposedStopCondition(stop_conditions...; reducer = any)`

The result of `stop_conditions`

is reduced by `reducer`

.

`ReinforcementLearningCore.DoEveryNEpisode`

— Type`DoEveryNEpisode(f; n=1, t=0)`

Execute `f(t, agent, env)`

every `n`

episode. `t`

is a counter of episodes.

`ReinforcementLearningCore.DoEveryNStep`

— Type`DoEveryNStep(f; n=1, t=0)`

Execute `f(t, agent, env)`

every `n`

step. `t`

is a counter of steps.

`ReinforcementLearningCore.DoOnExit`

— Type`DoOnExit(f)`

Call the lambda function `f`

at the end of an `Experiment`

.

`ReinforcementLearningCore.EmptyHook`

— TypeDo nothing

`ReinforcementLearningCore.EpsilonGreedyExplorer`

— Type```
EpsilonGreedyExplorer{T}(;kwargs...)
EpsilonGreedyExplorer(ϵ) -> EpsilonGreedyExplorer{:linear}(; ϵ_stable = ϵ)
```

Epsilon-greedy strategy: The best lever is selected for a proportion

`1 - epsilon`

of the trials, and a lever is selected at random (with uniform probability) for a proportion epsilon . Multi-armed_bandit

Two kinds of epsilon-decreasing strategy are implmented here (`linear`

and `exp`

).

Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of epsilon decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish. - Multi-armed_bandit

**Keywords**

`T::Symbol`

: defines how to calculate the epsilon in the warmup steps. Supported values are`linear`

and`exp`

.`step::Int = 1`

: record the current step.`ϵ_init::Float64 = 1.0`

: initial epsilon.`warmup_steps::Int=0`

: the number of steps to use`ϵ_init`

.`decay_steps::Int=0`

: the number of steps for epsilon to decay from`ϵ_init`

to`ϵ_stable`

.`ϵ_stable::Float64`

: the epsilon after`warmup_steps + decay_steps`

.`is_break_tie=false`

: randomly select an action of the same maximum values if set to`true`

.`rng=Random.GLOBAL_RNG`

: set the internal RNG.`is_training=true`

, in training mode,`step`

will not be updated. And the`ϵ`

will be set to 0.

**Example**

```
s = EpsilonGreedyExplorer{:linear}(ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot([RL.get_ϵ(s, i) for i in 1:500], label="linear epsilon")
```

```
s = EpsilonGreedyExplorer{:exp}(ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot([RL.get_ϵ(s, i) for i in 1:500], label="exp epsilon")
```

`ReinforcementLearningCore.EpsilonGreedyExplorer`

— Method`(s::EpsilonGreedyExplorer)(values; step) where T`

If multiple values with the same maximum value are found. Then a random one will be returned!

`NaN`

will be filtered unless all the values are `NaN`

. In that case, a random one will be returned.

`ReinforcementLearningCore.Experiment`

— Type`Experiment(policy, env, stop_condition, hook, description)`

These are the four essential components in a typical reinforcement learning experiment:

`policy`

, generates an action during the interaction with the`env`

. It may update its strategy in the meanwhile.`env`

, the environment we're going to experiment with.`stop_condition`

, defines the when the experiment terminates.`hook`

, collects some intermediate data during the experiment.`description`

, displays some useful information for logging.

`ReinforcementLearningCore.MultiAgentHook`

— Type`MultiAgentHook(player=>hook...)`

`ReinforcementLearningCore.MultiAgentManager`

— Method`MultiAgentManager(player => policy...)`

This is the simplest form of multiagent system. At each step they observe the environment from their own perspective and get updated independently. For environments of `SEQUENTIAL`

style, agents which are not the current player will observe a dummy action of `NO_OP`

in the `PreActStage`

.

`ReinforcementLearningCore.NamedPolicy`

— Type`NamedPolicy(name=>policy)`

A policy wrapper to provide a name. Mostly used in multi-agent environments.

`ReinforcementLearningCore.NeuralNetworkApproximator`

— Type`NeuralNetworkApproximator(;kwargs)`

Use a DNN model for value estimation.

**Keyword arguments**

`model`

, a Flux based DNN model.`optimizer=nothing`

`ReinforcementLearningCore.NoOp`

— TypeRepresent no-operation if it's not the agent's turn.

`ReinforcementLearningCore.QBasedPolicy`

— Type`QBasedPolicy(;learner::Q, explorer::S)`

Use a Q-`learner`

to generate estimations of action values. Then an `explorer`

is applied on the estimations to select an action.

`ReinforcementLearningCore.RandomPolicy`

— Type`RandomPolicy(action_space=nothing; rng=Random.GLOBAL_RNG)`

If `action_space`

is `nothing`

, then it will use the `legal_action_space`

at runtime to randomly select an action. Otherwise, a random element within `action_space`

is selected.

You should always set `action_space=nothing`

when dealing with environments of `FULL_ACTION_SET`

.

`ReinforcementLearningCore.RewardsPerEpisode`

— Type`RewardsPerEpisode(; rewards = Vector{Vector{Float64}}())`

Store each reward of each step in every episode in the field of `rewards`

.

`ReinforcementLearningCore.StackFrames`

— Type`StackFrames(::Type{T}=Float32, d::Int...)`

Use a pre-initialized `CircularArrayBuffer`

to store the latest several states specified by `d`

. Before processing any observation, the buffer is filled with `zero{T} by default.

`ReinforcementLearningCore.StepsPerEpisode`

— Type`StepsPerEpisode(; steps = Int[], count = 0)`

Store steps of each episode in the field of `steps`

.

`ReinforcementLearningCore.StopAfterEpisode`

— Type`StopAfterEpisode(episode; cur = 0, is_show_progress = true)`

Return `true`

after being called `episode`

. If `is_show_progress`

is `true`

, the `ProgressMeter`

will be used to show progress.

`ReinforcementLearningCore.StopAfterNSeconds`

— TypeStopAfterNSeconds

parameter:

- time badget

stop training after N seconds

`ReinforcementLearningCore.StopAfterNoImprovement`

— TypeStopAfterNoImprovement()

Stop training when a monitored metric has stopped improving.

Parameters:

fn: a closure, return a scalar value, which indicates the performance of the policy (the higher the better) e.g.

- () -> reward(env)
- () -> total
*reward*per_episode.reward

patience: Number of epochs with no improvement after which training will be stopped.

δ: Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.

Return `true`

after the monitored metric has stopped improving.

`ReinforcementLearningCore.StopAfterStep`

— Type`StopAfterStep(step; cur = 1, is_show_progress = true)`

Return `true`

after being called `step`

times.

`ReinforcementLearningCore.StopSignal`

— Type`StopSignal()`

Create a stop signal initialized with a value of `false`

. You can manually set it to `true`

by `s[] = true`

to stop the running loop at any time.

`ReinforcementLearningCore.StopWhenDone`

— Type`StopWhenDone()`

Return `true`

if the environment is terminated.

`ReinforcementLearningCore.SumTree`

— Type`SumTree(capacity::Int)`

Efficiently sample and update weights. For more detals, see the post at here. Here we use a vector to represent the binary tree. Suppose we will have `capacity`

leaves at most. Every time we `push!`

new node into the tree, only the recent `capacity`

node and their sum will be updated! [––––––Parent nodes––––––][––––leaves––––] [size: 2^ceil(Int, log2(capacity))-1 ][ size: capacity ]

**Example**

```
julia> t = SumTree(8)
0-element SumTree
julia> for i in 1:16
push!(t, i)
end
julia> t
8-element SumTree:
9.0
10.0
11.0
12.0
13.0
14.0
15.0
16.0
julia> sample(t)
(2, 10.0)
julia> sample(t)
(1, 9.0)
julia> inds, ps = sample(t,100000)
([8, 4, 8, 1, 5, 2, 2, 7, 6, 6 … 1, 1, 7, 1, 6, 1, 5, 7, 2, 7], [16.0, 12.0, 16.0, 9.0, 13.0, 10.0, 10.0, 15.0, 14.0, 14.0 … 9.0, 9.0, 15.0, 9.0, 14.0, 9.0, 13.0, 15.0, 10.0, 15.0])
julia> countmap(inds)
Dict{Int64,Int64} with 8 entries:
7 => 14991
4 => 12019
2 => 10003
3 => 11027
5 => 12971
8 => 16052
6 => 13952
1 => 8985
julia> countmap(ps)
Dict{Float64,Int64} with 8 entries:
9.0 => 8985
13.0 => 12971
10.0 => 10003
14.0 => 13952
16.0 => 16052
11.0 => 11027
15.0 => 14991
12.0 => 12019
```

`ReinforcementLearningCore.TabularApproximator`

— Type`TabularApproximator(table<:AbstractArray, opt)`

For `table`

of 1-d, it will serve as a state value approximator. For `table`

of 2-d, it will serve as a state-action value approximator.

For `table`

of 2-d, the first dimension is action and the second dimension is state.

`ReinforcementLearningCore.TabularRandomPolicy`

— Type`TabularRandomPolicy(;table=Dict{Int, Float32}(), rng=Random.GLOBAL_RNG)`

Use a `Dict`

to store action distribution.

`ReinforcementLearningCore.TimePerStep`

— Type```
TimePerStep(;max_steps=100)
TimePerStep(times::CircularArrayBuffer{Float64}, t::UInt64)
```

Store time cost of the latest `max_steps`

in the `times`

field.

`ReinforcementLearningCore.TotalBatchRewardPerEpisode`

— Method`TotalBatchRewardPerEpisode(batch_size::Int; is_display_on_exit=true)`

Similar to `TotalRewardPerEpisode`

, but is specific to environments which return a `Vector`

of rewards (a typical case with `MultiThreadEnv`

). If `is_display_on_exit`

is set to `true`

, a ribbon plot will be shown to reflect the mean and std of rewards.

`ReinforcementLearningCore.TotalRewardPerEpisode`

— Type`TotalRewardPerEpisode(; rewards = Float64[], reward = 0.0, is_display_on_exit = true)`

Store the total reward of each episode in the field of `rewards`

. If `is_display_on_exit`

is set to `true`

, a unicode plot will be shown at the `PostExperimentStage`

.

`ReinforcementLearningCore.Trajectory`

— Type`Trajectory(;[trace_name=trace_container]...)`

A simple wrapper of `NamedTuple`

. Define our own type here to avoid type piracy with `NamedTuple`

`ReinforcementLearningCore.UCBExplorer`

— Method`UCBExplorer(na; c=2.0, ϵ=1e-10, step=1, seed=nothing)`

**Arguments**

`na`

is the number of actions used to create a internal counter.`t`

is used to store current time step.`c`

is used to control the degree of exploration.`seed`

, set the seed of inner RNG.`is_training=true`

, in training mode, time step and counter will not be updated.

`ReinforcementLearningCore.UploadTrajectoryEveryNStep`

— Type`UploadTrajectoryEveryNStep(;mailbox, n, sealer=deepcopy)`

`ReinforcementLearningCore.VBasedPolicy`

— Type`VBasedPolicy(;learner, mapping=default_value_action_mapping)`

The `learner`

must be a value learner. The `mapping`

is a function which returns an action given `env`

and the `learner`

. By default we iterate through all the valid actions and select the best one which lead to the maximum state value.

`ReinforcementLearningCore.WeightedExplorer`

— Type`WeightedExplorer(;is_normalized::Bool, rng=Random.GLOBAL_RNG)`

`is_normalized`

is used to indicate if the feeded action values are alrady normalized to have a sum of `1.0`

.

Elements are assumed to be `>=0`

.

See also: `WeightedSoftmaxExplorer`

`Base.push!`

— MethodWhen pushing a `StackFrames`

into a `CircularArrayBuffer`

of the same dimension, only the latest frame is pushed. If the `StackFrames`

is one dimension lower, then it is treated as a general `AbstractArray`

and is pushed in as a frame.

`CUDA.device`

— Method`device(model)`

Detect the suitable running device for the `model`

. Return `Val(:cpu)`

by default.

`ReinforcementLearningBase.priority`

— Method`get_priority(p::AbstractLearner, experience)`

`ReinforcementLearningBase.prob`

— Method`prob(p::AbstractExplorer, x, mask)`

Similart to `prob(p::AbstractExplorer, x)`

, but here only the `mask`

ed elements are considered.

`ReinforcementLearningBase.prob`

— Method`prob(p::AbstractExplorer, x) -> AbstractDistribution`

Get the action distribution given action values.

`ReinforcementLearningBase.prob`

— Method```
prob(s::EpsilonGreedyExplorer, values) ->Categorical
prob(s::EpsilonGreedyExplorer, values, mask) ->Categorical
```

Return the probability of selecting each action given the estimated `values`

of each action.

`ReinforcementLearningBase.update!`

— Method`update!(a::AbstractApproximator, correction)`

Usually the `correction`

is the gradient of inner parameters.

`ReinforcementLearningBase.update!`

— Method`update!(p::TabularRandomPolicy, state => value)`

You should manually check `value`

sum to `1.0`

.

`ReinforcementLearningCore.ApproximatorStyle`

— MethodUsed to detect what an `AbstractApproximator`

is approximating.

`ReinforcementLearningCore._discount_rewards!`

— Methodassuming rewards and new_rewards are Vector

`ReinforcementLearningCore._generalized_advantage_estimation!`

— Methodassuming rewards and advantages are Vector

`ReinforcementLearningCore.check`

— MethodInject some customized checkings here by overwriting this function

`ReinforcementLearningCore.consecutive_view`

— Method`consecutive_view(x::AbstractArray, inds; n_stack = nothing, n_horizon = nothing)`

By default, it behaves the same with `select_last_dim(x, inds)`

. If `n_stack`

is set to an int, then for each frame specified by `inds`

, the previous `n_stack`

frames (including the current one) are concatenated as a new dimension. If `n_horizon`

is set to an int, then for each frame specified by `inds`

, the next `n_horizon`

frames (including the current one) are concatenated as a new dimension.

**Example**

```
julia> x = collect(1:5)
5-element Array{Int64,1}:
1
2
3
4
5
julia> consecutive_view(x, [2,4]) # just the same with `select_last_dim(x, [2,4])`
2-element view(::Array{Int64,1}, [2, 4]) with eltype Int64:
2
4
julia> consecutive_view(x, [2,4];n_stack = 2)
2×2 view(::Array{Int64,1}, [1 3; 2 4]) with eltype Int64:
1 3
2 4
julia> consecutive_view(x, [2,4];n_horizon = 2)
2×2 view(::Array{Int64,1}, [2 4; 3 5]) with eltype Int64:
2 4
3 5
julia> consecutive_view(x, [2,4];n_horizon = 2, n_stack=2) # note the order here, first we stack, then we apply the horizon
2×2×2 view(::Array{Int64,1}, [1 2; 2 3]
[3 4; 4 5]) with eltype Int64:
[:, :, 1] =
1 2
2 3
[:, :, 2] =
3 4
4 5
```

See also Frame Skipping and Preprocessing for Deep Q networks to gain a better understanding of state stacking and n-step learning.

`ReinforcementLearningCore.discount_rewards`

— Method`discount_rewards(rewards::VectorOrMatrix, γ::Number;kwargs...)`

Calculate the gain started from the current step with discount rate of `γ`

. `rewards`

can be a matrix.

**Keyword argments**

`dims=:`

, if`rewards`

is a`Matrix`

, then`dims`

can only be`1`

or`2`

.`terminal=nothing`

, specify if each reward follows by a terminal.`nothing`

means the game is not terminated yet. If`terminal`

is provided, then the size must be the same with`rewards`

.`init=nothing`

,`init`

can be used to provide the the reward estimation of the last state.

**Example**

`ReinforcementLearningCore.flatten_batch`

— Method`flatten_batch(x::AbstractArray)`

Merge the last two dimension.

**Example**

```
julia> x = reshape(1:12, 2, 2, 3)
2×2×3 reshape(::UnitRange{Int64}, 2, 2, 3) with eltype Int64:
[:, :, 1] =
1 3
2 4
[:, :, 2] =
5 7
6 8
[:, :, 3] =
9 11
10 12
julia> flatten_batch(x)
2×6 reshape(::UnitRange{Int64}, 2, 6) with eltype Int64:
1 3 5 7 9 11
2 4 6 8 10 12
```

`ReinforcementLearningCore.generalized_advantage_estimation`

— Method`generalized_advantage_estimation(rewards::VectorOrMatrix, values::VectorOrMatrix, γ::Number, λ::Number;kwargs...)`

Calculate the generalized advantage estimate started from the current step with discount rate of `γ`

and a lambda for GAE-Lambda of 'λ'. `rewards`

and 'values' can be a matrix.

**Keyword argments**

`dims=:`

, if`rewards`

is a`Matrix`

, then`dims`

can only be`1`

or`2`

.`terminal=nothing`

, specify if each reward follows by a terminal.`nothing`

means the game is not terminated yet. If`terminal`

is provided, then the size must be the same with`rewards`

.

**Example**

`ReinforcementLearningCore.normlogpdf`

— MethodGPU automatic differentiable version for the logpdf function of normal distributions. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer).

`StatsBase.sample`

— Method`sample([rng=Random.GLOBAL_RNG], trajectory, sampler, [traces=Val(keys(trajectory))])`

Here we return a copy instead of a view:

- Each sample is independent of the original
`trajectory`

so that`trajectory`

can be updated async. - Copy is not always so bad.