An approximator is a functional object for value estimation. It serves as a black box to provides an abstraction over different kinds of approximate methods (for example DNN provided by Flux or Knet).


A hook is called at different stage duiring a run to allow users to inject customized runtime logic. By default, a AbstractHook will do nothing. One can override the behavior by implementing the following methods:

  • (hook::YourHook)(::PreActStage, agent, env, action), note that there's an extra argument of action.
  • (hook::YourHook)(::PostActStage, agent, env)
  • (hook::YourHook)(::PreEpisodeStage, agent, env)
  • (hook::YourHook)(::PostEpisodeStage, agent, env)

A trajectory is used to record some useful information during the interactions between agents and environments. It behaves similar to a NamedTuple except that we extend it with some optional methods.

Required Methods:

  • Base.getindex
  • Base.keys

Optional Methods:

  • Base.length
  • Base.isempty
  • Base.empty!
  • Base.haskey
  • Base.push!
  • Base.pop!
ActorCritic(;actor, critic, optimizer=ADAM())

The actor part must return logits (Do not use softmax in the last layer!), and the critic part must return a state value.


A wrapper of an AbstractPolicy. Generally speaking, it does nothing but to update the trajectory and policy appropriately in different stages.

Keywords & Fields


Here we extend the definition of (p::AbstractPolicy)(::AbstractEnv) in RLBase to accept an AbstractStage as the first argument. Algorithm designers may customize these behaviors respectively by implementing:

  • (p::YourPolicy)(::AbstractStage, ::AbstractEnv)
  • (p::YourPolicy)(::PreActStage, ::AbstractEnv, action)

The default behaviors for Agent are:

  1. Update the inner trajectory given the context of policy, env, and stage.

  2. By default we do nothing.

  3. In PreActStage, we push! the current state and the action into the trajectory.

  4. In PostActStage, we query the reward and is_terminated info from env and push them into trajectory.

  5. In the PosEpisodeStage, we push the state at the end of an episode and a dummy action into the trajectory.

  6. In the PreEpisodeStage, we pop out the lastest state and action pair (which are dummy ones) from trajectory.

  7. Update the inner policy given the context of trajectory, env, and stage.

  8. By default, we only update! the policy in the PreActStage. And it's despatched to update!(policy, trajectory).

EpsilonGreedyExplorer(ϵ) -> EpsilonGreedyExplorer{:linear}(; ϵ_stable = ϵ)

Epsilon-greedy strategy: The best lever is selected for a proportion 1 - epsilon of the trials, and a lever is selected at random (with uniform probability) for a proportion epsilon . Multi-armed_bandit

Two kinds of epsilon-decreasing strategy are implmented here (linear and exp).

Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of epsilon decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish. - Multi-armed_bandit


  • T::Symbol: defines how to calculate the epsilon in the warmup steps. Supported values are linear and exp.
  • step::Int = 1: record the current step.
  • ϵ_init::Float64 = 1.0: initial epsilon.
  • warmup_steps::Int=0: the number of steps to use ϵ_init.
  • decay_steps::Int=0: the number of steps for epsilon to decay from ϵ_init to ϵ_stable.
  • ϵ_stable::Float64: the epsilon after warmup_steps + decay_steps.
  • is_break_tie=false: randomly select an action of the same maximum values if set to true.
  • rng=Random.GLOBAL_RNG: set the internal RNG.
  • is_training=true, in training mode, step will not be updated. And the ϵ will be set to 0.


s = EpsilonGreedyExplorer{:linear}(ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot([RL.get_ϵ(s, i) for i in 1:500], label="linear epsilon")

s = EpsilonGreedyExplorer{:exp}(ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot([RL.get_ϵ(s, i) for i in 1:500], label="exp epsilon")

(s::EpsilonGreedyExplorer)(values; step) where T

If multiple values with the same maximum value are found. Then a random one will be returned!

NaN will be filtered unless all the values are NaN. In that case, a random one will be returned.

MultiAgentManager(player => policy...)

This is the simplest form of multiagent system. At each step they observe the environment from their own perspective and get updated independently. For environments of SEQUENTIAL style, agents which are not the current player will observe a dummy action of NO_OP in the PreActStage.

QBasedPolicy(;learner::Q, explorer::S)

Use a Q-learner to generate estimations of action values. Then an explorer is applied on the estimations to select an action.

RandomPolicy(action_space=nothing; rng=Random.GLOBAL_RNG)

If action_space is nothing, then it will use the legal_action_space at runtime to randomly select an action. Otherwise, a random element within action_space is selected.


You should always set action_space=nothing when dealing with environments of FULL_ACTION_SET.

ResizeImage(img::Array{T, N})
ResizeImage(dims::Int...) -> ResizeImage(Float32, dims...)
ResizeImage(T::Type{<:Number}, dims::Int...)

Using BSpline method to resize the state field of an observation to size of img (or dims).

StopAfterEpisode(episode; cur = 0, is_show_progress = true)

Return true after being called episode. If is_show_progress is true, the ProgressMeter will be used to show progress.



Stop training when a monitored metric has stopped improving.


fn: a closure, return a scalar value, which indicates the performance of the policy (the higher the better) e.g.

  1. () -> reward(env)
  2. () -> totalrewardper_episode.reward

patience: Number of epochs with no improvement after which training will be stopped.

δ: Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.

Return true after the monitored metric has stopped improving.


Create a stop signal initialized with a value of false. You can manually set it to true by s[] = true to stop the running loop at any time.


Efficiently sample and update weights. For more detals, see the post at here. Here we use a vector to represent the binary tree. Suppose we will have capacity leaves at most. Every time we push! new node into the tree, only the recent capacity node and their sum will be updated! [––––––Parent nodes––––––][––––leaves––––] [size: 2^ceil(Int, log2(capacity))-1 ][ size: capacity ]


julia> t = SumTree(8)
0-element SumTree
julia> for i in 1:16
       push!(t, i)
julia> t
8-element SumTree:
julia> sample(t)
(2, 10.0)
julia> sample(t)
(1, 9.0)
julia> inds, ps = sample(t,100000)
([8, 4, 8, 1, 5, 2, 2, 7, 6, 6  …  1, 1, 7, 1, 6, 1, 5, 7, 2, 7], [16.0, 12.0, 16.0, 9.0, 13.0, 10.0, 10.0, 15.0, 14.0, 14.0  …  9.0, 9.0, 15.0, 9.0, 14.0, 9.0, 13.0, 15.0, 10.0, 15.0])
julia> countmap(inds)
Dict{Int64,Int64} with 8 entries:
  7 => 14991
  4 => 12019
  2 => 10003
  3 => 11027
  5 => 12971
  8 => 16052
  6 => 13952
  1 => 8985
julia> countmap(ps)
Dict{Float64,Int64} with 8 entries:
  9.0  => 8985
  13.0 => 12971
  10.0 => 10003
  14.0 => 13952
  16.0 => 16052
  11.0 => 11027
  15.0 => 14991
  12.0 => 12019
TabularApproximator(table<:AbstractArray, opt)

For table of 1-d, it will serve as a state value approximator. For table of 2-d, it will serve as a state-action value approximator.


For table of 2-d, the first dimension is action and the second dimension is state.

TimePerStep(times::CircularArrayBuffer{Float64}, t::UInt64)

Store time cost of the latest max_steps in the times field.


A simple wrapper of NamedTuple. Define our own type here to avoid type piracy with NamedTuple

UCBExplorer(na; c=2.0, ϵ=1e-10, step=1, seed=nothing)


  • na is the number of actions used to create a internal counter.
  • t is used to store current time step.
  • c is used to control the degree of exploration.
  • seed, set the seed of inner RNG.
  • is_training=true, in training mode, time step and counter will not be updated.
VBasedPolicy(;learner, mapping=default_value_action_mapping)

The learner must be a value learner. The mapping is a function which returns an action given env and the learner. By default we iterate through all the valid actions and select the best one which lead to the maximum state value.


When pushing a StackFrames into a CircularArrayBuffer of the same dimension, only the latest frame is pushed. If the StackFrames is one dimension lower, then it is treated as a general AbstractArray and is pushed in as a frame.


Detect the suitable running device for the model. Return Val(:cpu) by default.

prob(p::AbstractExplorer, x, mask)

Similart to prob(p::AbstractExplorer, x), but here only the masked elements are considered.

prob(s::EpsilonGreedyExplorer, values) ->Categorical
prob(s::EpsilonGreedyExplorer, values, mask) ->Categorical

Return the probability of selecting each action given the estimated values of each action.

consecutive_view(x::AbstractArray, inds; n_stack = nothing, n_horizon = nothing)

By default, it behaves the same with select_last_dim(x, inds). If n_stack is set to an int, then for each frame specified by inds, the previous n_stack frames (including the current one) are concatenated as a new dimension. If n_horizon is set to an int, then for each frame specified by inds, the next n_horizon frames (including the current one) are concatenated as a new dimension.


julia> x = collect(1:5)
5-element Array{Int64,1}:

julia> consecutive_view(x, [2,4])  # just the same with `select_last_dim(x, [2,4])`
2-element view(::Array{Int64,1}, [2, 4]) with eltype Int64:

julia> consecutive_view(x, [2,4];n_stack = 2)
2×2 view(::Array{Int64,1}, [1 3; 2 4]) with eltype Int64:
 1  3
 2  4

julia> consecutive_view(x, [2,4];n_horizon = 2)
2×2 view(::Array{Int64,1}, [2 4; 3 5]) with eltype Int64:
 2  4
 3  5

julia> consecutive_view(x, [2,4];n_horizon = 2, n_stack=2)  # note the order here, first we stack, then we apply the horizon
2×2×2 view(::Array{Int64,1}, [1 2; 2 3]

[3 4; 4 5]) with eltype Int64:
[:, :, 1] =
 1  2
 2  3

[:, :, 2] =
 3  4
 4  5

See also Frame Skipping and Preprocessing for Deep Q networks to gain a better understanding of state stacking and n-step learning.

discount_rewards(rewards::VectorOrMatrix, γ::Number;kwargs...)

Calculate the gain started from the current step with discount rate of γ. rewards can be a matrix.

Keyword argments

  • dims=:, if rewards is a Matrix, then dims can only be 1 or 2.
  • terminal=nothing, specify if each reward follows by a terminal. nothing means the game is not terminated yet. If terminal is provided, then the size must be the same with rewards.
  • init=nothing, init can be used to provide the the reward estimation of the last state.



Merge the last two dimension.


julia> x = reshape(1:12, 2, 2, 3)
2×2×3 reshape(::UnitRange{Int64}, 2, 2, 3) with eltype Int64:
[:, :, 1] =
 1  3
 2  4

[:, :, 2] =
 5  7
 6  8

[:, :, 3] =
  9  11
 10  12

julia> flatten_batch(x)
2×6 reshape(::UnitRange{Int64}, 2, 6) with eltype Int64:
 1  3  5  7   9  11
 2  4  6  8  10  12
generalized_advantage_estimation(rewards::VectorOrMatrix, values::VectorOrMatrix, γ::Number, λ::Number;kwargs...)

Calculate the generalized advantage estimate started from the current step with discount rate of γ and a lambda for GAE-Lambda of 'λ'. rewards and 'values' can be a matrix.

Keyword argments

  • dims=:, if rewards is a Matrix, then dims can only be 1 or 2.
  • terminal=nothing, specify if each reward follows by a terminal. nothing means the game is not terminated yet. If terminal is provided, then the size must be the same with rewards.



GPU automatic differentiable version for the logpdf function of normal distributions. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer).

sample([rng=Random.GLOBAL_RNG], trajectory, sampler, [traces=Val(keys(trajectory))])

Here we return a copy instead of a view:

  1. Each sample is independent of the original trajectory so that trajectory can be updated async.
  2. Copy is not always so bad.