ReinforcementLearningCore.jl

ReinforcementLearningCore.AbstractApproximatorType
(app::AbstractApproximator)(env)

An approximator is a functional object for value estimation. It serves as a black box to provides an abstraction over different kinds of approximate methods (for example DNN provided by Flux or Knet).

ReinforcementLearningCore.AbstractHookType

A hook is called at different stage duiring a run to allow users to inject customized runtime logic. By default, a AbstractHook will do nothing. One can override the behavior by implementing the following methods:

• (hook::YourHook)(::PreActStage, agent, env, action), note that there's an extra argument of action.
• (hook::YourHook)(::PostActStage, agent, env)
• (hook::YourHook)(::PreEpisodeStage, agent, env)
• (hook::YourHook)(::PostEpisodeStage, agent, env)
ReinforcementLearningCore.AbstractTrajectoryType
AbstractTrajectory

A trajectory is used to record some useful information during the interactions between agents and environments. It behaves similar to a NamedTuple except that we extend it with some optional methods.

Required Methods:

• Base.getindex
• Base.keys

Optional Methods:

• Base.length
• Base.isempty
• Base.empty!
• Base.haskey
• Base.push!
• Base.pop!
ReinforcementLearningCore.ActorCriticType
ActorCritic(;actor, critic, optimizer=ADAM())

The actor part must return logits (Do not use softmax in the last layer!), and the critic part must return a state value.

ReinforcementLearningCore.AgentMethod

Here we extend the definition of (p::AbstractPolicy)(::AbstractEnv) in RLBase to accept an AbstractStage as the first argument. Algorithm designers may customize these behaviors respectively by implementing:

• (p::YourPolicy)(::AbstractStage, ::AbstractEnv)
• (p::YourPolicy)(::PreActStage, ::AbstractEnv, action)

The default behaviors for Agent are:

1. Update the inner trajectory given the context of policy, env, and stage.

2. By default we do nothing.

3. In PreActStage, we push! the current state and the action into the trajectory.

4. In PostActStage, we query the reward and is_terminated info from env and push them into trajectory.

5. In the PosEpisodeStage, we push the state at the end of an episode and a dummy action into the trajectory.

6. In the PreEpisodeStage, we pop out the lastest state and action pair (which are dummy ones) from trajectory.

7. Update the inner policy given the context of trajectory, env, and stage.

8. By default, we only update! the policy in the PreActStage. And it's despatched to update!(policy, trajectory).

ReinforcementLearningCore.EpsilonGreedyExplorerType
EpsilonGreedyExplorer{T}(;kwargs...)
EpsilonGreedyExplorer(ϵ) -> EpsilonGreedyExplorer{:linear}(; ϵ_stable = ϵ)

Epsilon-greedy strategy: The best lever is selected for a proportion 1 - epsilon of the trials, and a lever is selected at random (with uniform probability) for a proportion epsilon . Multi-armed_bandit

Two kinds of epsilon-decreasing strategy are implmented here (linear and exp).

Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of epsilon decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish. - Multi-armed_bandit

Keywords

• T::Symbol: defines how to calculate the epsilon in the warmup steps. Supported values are linear and exp.
• step::Int = 1: record the current step.
• ϵ_init::Float64 = 1.0: initial epsilon.
• warmup_steps::Int=0: the number of steps to use ϵ_init.
• decay_steps::Int=0: the number of steps for epsilon to decay from ϵ_init to ϵ_stable.
• ϵ_stable::Float64: the epsilon after warmup_steps + decay_steps.
• is_break_tie=false: randomly select an action of the same maximum values if set to true.
• rng=Random.GLOBAL_RNG: set the internal RNG.
• is_training=true, in training mode, step will not be updated. And the ϵ will be set to 0.

Example

s = EpsilonGreedyExplorer{:linear}(ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot([RL.get_ϵ(s, i) for i in 1:500], label="linear epsilon")

s = EpsilonGreedyExplorer{:exp}(ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot([RL.get_ϵ(s, i) for i in 1:500], label="exp epsilon")

ReinforcementLearningCore.EpsilonGreedyExplorerMethod
(s::EpsilonGreedyExplorer)(values; step) where T
Note

If multiple values with the same maximum value are found. Then a random one will be returned!

NaN will be filtered unless all the values are NaN. In that case, a random one will be returned.

ReinforcementLearningCore.MultiAgentManagerMethod
MultiAgentManager(player => policy...)

This is the simplest form of multiagent system. At each step they observe the environment from their own perspective and get updated independently. For environments of SEQUENTIAL style, agents which are not the current player will observe a dummy action of NO_OP in the PreActStage.

ReinforcementLearningCore.QBasedPolicyType
QBasedPolicy(;learner::Q, explorer::S)

Use a Q-learner to generate estimations of action values. Then an explorer is applied on the estimations to select an action.

ReinforcementLearningCore.RandomPolicyType
RandomPolicy(action_space=nothing; rng=Random.GLOBAL_RNG)

If action_space is nothing, then it will use the legal_action_space at runtime to randomly select an action. Otherwise, a random element within action_space is selected.

Note

You should always set action_space=nothing when dealing with environments of FULL_ACTION_SET.

ReinforcementLearningCore.ResizeImageType
ResizeImage(img::Array{T, N})
ResizeImage(dims::Int...) -> ResizeImage(Float32, dims...)
ResizeImage(T::Type{<:Number}, dims::Int...)

Using BSpline method to resize the state field of an observation to size of img (or dims).

ReinforcementLearningCore.StopAfterEpisodeType
StopAfterEpisode(episode; cur = 0, is_show_progress = true)

Return true after being called episode. If is_show_progress is true, the ProgressMeter will be used to show progress.

ReinforcementLearningCore.StopAfterNoImprovementType

StopAfterNoImprovement()

Stop training when a monitored metric has stopped improving.

Parameters:

fn: a closure, return a scalar value, which indicates the performance of the policy (the higher the better) e.g.

1. () -> reward(env)
2. () -> totalrewardper_episode.reward

patience: Number of epochs with no improvement after which training will be stopped.

δ: Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.

Return true after the monitored metric has stopped improving.

ReinforcementLearningCore.StopSignalType
StopSignal()

Create a stop signal initialized with a value of false. You can manually set it to true by s[] = true to stop the running loop at any time.

ReinforcementLearningCore.SumTreeType
SumTree(capacity::Int)

Efficiently sample and update weights. For more detals, see the post at here. Here we use a vector to represent the binary tree. Suppose we will have capacity leaves at most. Every time we push! new node into the tree, only the recent capacity node and their sum will be updated! [––––––Parent nodes––––––][––––leaves––––] [size: 2^ceil(Int, log2(capacity))-1 ][ size: capacity ]

Example

julia> t = SumTree(8)
0-element SumTree
julia> for i in 1:16
push!(t, i)
end
julia> t
8-element SumTree:
9.0
10.0
11.0
12.0
13.0
14.0
15.0
16.0
julia> sample(t)
(2, 10.0)
julia> sample(t)
(1, 9.0)
julia> inds, ps = sample(t,100000)
([8, 4, 8, 1, 5, 2, 2, 7, 6, 6  …  1, 1, 7, 1, 6, 1, 5, 7, 2, 7], [16.0, 12.0, 16.0, 9.0, 13.0, 10.0, 10.0, 15.0, 14.0, 14.0  …  9.0, 9.0, 15.0, 9.0, 14.0, 9.0, 13.0, 15.0, 10.0, 15.0])
julia> countmap(inds)
Dict{Int64,Int64} with 8 entries:
7 => 14991
4 => 12019
2 => 10003
3 => 11027
5 => 12971
8 => 16052
6 => 13952
1 => 8985
julia> countmap(ps)
Dict{Float64,Int64} with 8 entries:
9.0  => 8985
13.0 => 12971
10.0 => 10003
14.0 => 13952
16.0 => 16052
11.0 => 11027
15.0 => 14991
12.0 => 12019
ReinforcementLearningCore.TabularApproximatorType
TabularApproximator(table<:AbstractArray, opt)

For table of 1-d, it will serve as a state value approximator. For table of 2-d, it will serve as a state-action value approximator.

Warning

For table of 2-d, the first dimension is action and the second dimension is state.

ReinforcementLearningCore.TimePerStepType
TimePerStep(;max_steps=100)
TimePerStep(times::CircularArrayBuffer{Float64}, t::UInt64)

Store time cost of the latest max_steps in the times field.

ReinforcementLearningCore.TrajectoryType
Trajectory(;[trace_name=trace_container]...)

A simple wrapper of NamedTuple. Define our own type here to avoid type piracy with NamedTuple

ReinforcementLearningCore.UCBExplorerMethod
UCBExplorer(na; c=2.0, ϵ=1e-10, step=1, seed=nothing)

Arguments

• na is the number of actions used to create a internal counter.
• t is used to store current time step.
• c is used to control the degree of exploration.
• seed, set the seed of inner RNG.
• is_training=true, in training mode, time step and counter will not be updated.
ReinforcementLearningCore.VBasedPolicyType
VBasedPolicy(;learner, mapping=default_value_action_mapping)

The learner must be a value learner. The mapping is a function which returns an action given env and the learner. By default we iterate through all the valid actions and select the best one which lead to the maximum state value.

Base.push!Method

When pushing a StackFrames into a CircularArrayBuffer of the same dimension, only the latest frame is pushed. If the StackFrames is one dimension lower, then it is treated as a general AbstractArray and is pushed in as a frame.

CUDA.deviceMethod
device(model)

Detect the suitable running device for the model. Return Val(:cpu) by default.

ReinforcementLearningBase.probMethod
prob(p::AbstractExplorer, x, mask)

Similart to prob(p::AbstractExplorer, x), but here only the masked elements are considered.

ReinforcementLearningBase.probMethod
prob(s::EpsilonGreedyExplorer, values) ->Categorical
prob(s::EpsilonGreedyExplorer, values, mask) ->Categorical

Return the probability of selecting each action given the estimated values of each action.

ReinforcementLearningCore.consecutive_viewMethod
consecutive_view(x::AbstractArray, inds; n_stack = nothing, n_horizon = nothing)

By default, it behaves the same with select_last_dim(x, inds). If n_stack is set to an int, then for each frame specified by inds, the previous n_stack frames (including the current one) are concatenated as a new dimension. If n_horizon is set to an int, then for each frame specified by inds, the next n_horizon frames (including the current one) are concatenated as a new dimension.

Example

julia> x = collect(1:5)
5-element Array{Int64,1}:
1
2
3
4
5

julia> consecutive_view(x, [2,4])  # just the same with select_last_dim(x, [2,4])
2-element view(::Array{Int64,1}, [2, 4]) with eltype Int64:
2
4

julia> consecutive_view(x, [2,4];n_stack = 2)
2×2 view(::Array{Int64,1}, [1 3; 2 4]) with eltype Int64:
1  3
2  4

julia> consecutive_view(x, [2,4];n_horizon = 2)
2×2 view(::Array{Int64,1}, [2 4; 3 5]) with eltype Int64:
2  4
3  5

julia> consecutive_view(x, [2,4];n_horizon = 2, n_stack=2)  # note the order here, first we stack, then we apply the horizon
2×2×2 view(::Array{Int64,1}, [1 2; 2 3]

[3 4; 4 5]) with eltype Int64:
[:, :, 1] =
1  2
2  3

[:, :, 2] =
3  4
4  5

See also Frame Skipping and Preprocessing for Deep Q networks to gain a better understanding of state stacking and n-step learning.

ReinforcementLearningCore.discount_rewardsMethod
discount_rewards(rewards::VectorOrMatrix, γ::Number;kwargs...)

Calculate the gain started from the current step with discount rate of γ. rewards can be a matrix.

Keyword argments

• dims=:, if rewards is a Matrix, then dims can only be 1 or 2.
• terminal=nothing, specify if each reward follows by a terminal. nothing means the game is not terminated yet. If terminal is provided, then the size must be the same with rewards.
• init=nothing, init can be used to provide the the reward estimation of the last state.

Example

ReinforcementLearningCore.flatten_batchMethod
flatten_batch(x::AbstractArray)

Merge the last two dimension.

Example

julia> x = reshape(1:12, 2, 2, 3)
2×2×3 reshape(::UnitRange{Int64}, 2, 2, 3) with eltype Int64:
[:, :, 1] =
1  3
2  4

[:, :, 2] =
5  7
6  8

[:, :, 3] =
9  11
10  12

julia> flatten_batch(x)
2×6 reshape(::UnitRange{Int64}, 2, 6) with eltype Int64:
1  3  5  7   9  11
2  4  6  8  10  12
ReinforcementLearningCore.generalized_advantage_estimationMethod
generalized_advantage_estimation(rewards::VectorOrMatrix, values::VectorOrMatrix, γ::Number, λ::Number;kwargs...)

Calculate the generalized advantage estimate started from the current step with discount rate of γ and a lambda for GAE-Lambda of 'λ'. rewards and 'values' can be a matrix.

Keyword argments

• dims=:, if rewards is a Matrix, then dims can only be 1 or 2.
• terminal=nothing, specify if each reward follows by a terminal. nothing means the game is not terminated yet. If terminal is provided, then the size must be the same with rewards.

Example

ReinforcementLearningCore.normlogpdfMethod

GPU automatic differentiable version for the logpdf function of normal distributions. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer).

StatsBase.sampleMethod
sample([rng=Random.GLOBAL_RNG], trajectory, sampler, [traces=Val(keys(trajectory))])
Note

Here we return a copy instead of a view:

1. Each sample is independent of the original trajectory so that trajectory can be updated async.
2. Copy is not always so bad.