ReinforcementLearningCore.jl

ReinforcementLearningCore.AbstractApproximatorType
(app::AbstractApproximator)(env)

An approximator is a functional object for value estimation. It serves as a black box to provides an abstraction over different kinds of approximate methods (for example DNN provided by Flux or Knet).

source
ReinforcementLearningCore.AbstractHookType

A hook is called at different stage duiring a run to allow users to inject customized runtime logic. By default, a AbstractHook will do nothing. One can override the behavior by implementing the following methods:

  • (hook::YourHook)(::PreActStage, agent, env, action), note that there's an extra argument of action.
  • (hook::YourHook)(::PostActStage, agent, env)
  • (hook::YourHook)(::PreEpisodeStage, agent, env)
  • (hook::YourHook)(::PostEpisodeStage, agent, env)
  • (hook::YourHook)(::PostExperimentStage, agent, env)
source
ReinforcementLearningCore.AbstractTrajectoryType
AbstractTrajectory

A trajectory is used to record some useful information during the interactions between agents and environments. It behaves similar to a NamedTuple except that we extend it with some optional methods.

Required Methods:

  • Base.getindex
  • Base.keys

Optional Methods:

  • Base.length
  • Base.isempty
  • Base.empty!
  • Base.haskey
  • Base.push!
  • Base.pop!
source
ReinforcementLearningCore.ActorCriticType
ActorCritic(;actor, critic, optimizer=ADAM())

The actor part must return logits (Do not use softmax in the last layer!), and the critic part must return a state value.

source
ReinforcementLearningCore.AgentType
Agent(;kwargs...)

A wrapper of an AbstractPolicy. Generally speaking, it does nothing but to update the trajectory and policy appropriately in different stages.

Keywords & Fields

source
ReinforcementLearningCore.AgentMethod

Here we extend the definition of (p::AbstractPolicy)(::AbstractEnv) in RLBase to accept an AbstractStage as the first argument. Algorithm designers may customize these behaviors respectively by implementing:

  • (p::YourPolicy)(::AbstractStage, ::AbstractEnv)
  • (p::YourPolicy)(::PreActStage, ::AbstractEnv, action)

The default behaviors for Agent are:

  1. Update the inner trajectory given the context of policy, env, and stage.

  2. By default we do nothing.

  3. In PreActStage, we push! the current state and the action into the trajectory.

  4. In PostActStage, we query the reward and is_terminated info from env and push them into trajectory.

  5. In the PosEpisodeStage, we push the state at the end of an episode and a dummy action into the trajectory.

  6. In the PreEpisodeStage, we pop out the latest state and action pair (which are dummy ones) from trajectory.

  7. Update the inner policy given the context of trajectory, env, and stage.

  8. By default, we only update! the policy in the PreActStage. And it's dispatched to update!(policy, trajectory, env, stage).

source
ReinforcementLearningCore.CircularArraySARTTrajectoryMethod
CircularArraySARTTrajectory(;capacity::Int, kw...)

A specialized CircularArrayTrajectory with traces of SART. Note that the capacity of the :state and :action trace is one step longer than the capacity of the :reward and :terminal trace, so that we can reuse the same trace to represent the next state and next action in a typical transition in reinforcement learning.

Keyword arguments

  • capacity::Int, the maximum number of transitions.
  • state::Pair{<:DataType, <:Tuple{Vararg{Int}}} = Int => (),
  • action::Pair{<:DataType, <:Tuple{Vararg{Int}}} = Int => (),
  • reward::Pair{<:DataType, <:Tuple{Vararg{Int}}} = Float32 => (),
  • terminal::Pair{<:DataType, <:Tuple{Vararg{Int}}} = Bool => (),

Example

julia> t = CircularArraySARTTrajectory(;
           capacity = 3,
           state = Vector{Int} => (4,),
           action = Int => (),
           reward = Float32 => (),
           terminal = Bool => (),
       )
Trajectory of 4 traces:
:state 4×0 CircularArrayBuffers.CircularArrayBuffer{Int64, 2}
:action 0-element CircularArrayBuffers.CircularVectorBuffer{Int64}
:reward 0-element CircularArrayBuffers.CircularVectorBuffer{Float32}
:terminal 0-element CircularArrayBuffers.CircularVectorBuffer{Bool}


julia> for i in 1:4
           push!(t;state=ones(Int, 4) .* i, action = i, reward=i/2, terminal=iseven(i))
       end

julia> push!(t;state=ones(Int,4) .* 5, action = 5)

julia> t[:state]
4×4 CircularArrayBuffers.CircularArrayBuffer{Int64, 2}:
 2  3  4  5
 2  3  4  5
 2  3  4  5
 2  3  4  5

julia> t[:action]
4-element CircularArrayBuffers.CircularVectorBuffer{Int64}:
 2
 3
 4
 5

julia> t[:reward]
3-element CircularArrayBuffers.CircularVectorBuffer{Float32}:
 1.0
 1.5
 2.0

julia> t[:terminal]
3-element CircularArrayBuffers.CircularVectorBuffer{Bool}:
 1
 0
 1
source
ReinforcementLearningCore.CircularVectorSARTTrajectoryMethod
CircularVectorSARTTrajectory(;capacity, kw::DataType...)

A specialized CircularVectorTrajectory with traces of SART. Note that the capacity of traces :state and :action are one step longer than the traces of :reward and :terminal, so that we can reuse the same underlying storage to represent the next state and next action in a typical transition in reinforcement learning.

Keyword arguments

  • capacity::Int
  • state = Int,
  • action = Int,
  • reward = Float32,
  • terminal = Bool,

Example

julia> t = CircularVectorSARTTrajectory(;
           capacity = 3,
           state = Vector{Int},
           action = Int,
           reward = Float32,
           terminal = Bool,
       )
Trajectory of 4 traces:
:state 0-element CircularArrayBuffers.CircularVectorBuffer{Vector{Int64}}
:action 0-element CircularArrayBuffers.CircularVectorBuffer{Int64}
:reward 0-element CircularArrayBuffers.CircularVectorBuffer{Float32}
:terminal 0-element CircularArrayBuffers.CircularVectorBuffer{Bool}


julia> for i in 1:4
           push!(t;state=ones(Int, 4) .* i, action = i, reward=i/2, terminal=iseven(i))
       end

julia> push!(t;state=ones(Int,4) .* 5, action = 5)

julia> t[:state]
4-element CircularArrayBuffers.CircularVectorBuffer{Vector{Int64}}:
 [2, 2, 2, 2]
 [3, 3, 3, 3]
 [4, 4, 4, 4]
 [5, 5, 5, 5]

julia> t[:action]
4-element CircularArrayBuffers.CircularVectorBuffer{Int64}:
 2
 3
 4
 5

julia> t[:reward]
3-element CircularArrayBuffers.CircularVectorBuffer{Float32}:
 1.0
 1.5
 2.0

julia> t[:terminal]
3-element CircularArrayBuffers.CircularVectorBuffer{Bool}:
 1
 0
 1
source
ReinforcementLearningCore.DuelingNetworkType
DuelingNetwork(;base, val, adv)

Dueling network automatically produces separate estimates of the state value function network and advantage function network. The expected output size of val is 1, and adv is the size of the action space.

source
ReinforcementLearningCore.ElasticSARTTrajectoryMethod
ElasticSARTTrajectory(;kw...)

A specialized ElasticArrayTrajectory with traces of SART.

Keyword arguments

  • state::Pair{<:DataType, <:Tuple{Vararg{Int}}} = Int => (), by default it means the state is a scalar of Int.
  • action::Pair{<:DataType, <:Tuple{Vararg{Int}}} = Int => (),
  • reward::Pair{<:DataType, <:Tuple{Vararg{Int}}} = Float32 => (),
  • terminal::Pair{<:DataType, <:Tuple{Vararg{Int}}} = Bool => (),

Example

julia> t = ElasticSARTTrajectory(;
           state = Vector{Int} => (4,),
           action = Int => (),
           reward = Float32 => (),
           terminal = Bool => (),
       )
Trajectory of 4 traces:
:state 4×0 ElasticArrays.ElasticMatrix{Int64, Vector{Int64}}
:action 0-element ElasticArrays.ElasticVector{Int64, Vector{Int64}}
:reward 0-element ElasticArrays.ElasticVector{Float32, Vector{Float32}}
:terminal 0-element ElasticArrays.ElasticVector{Bool, Vector{Bool}}


julia> for i in 1:4
           push!(t;state=ones(Int, 4) .* i, action = i, reward=i/2, terminal=iseven(i))
       end

julia> push!(t;state=ones(Int,4) .* 5, action = 5)

julia> t
Trajectory of 4 traces:
:state 4×5 ElasticArrays.ElasticMatrix{Int64, Vector{Int64}}
:action 5-element ElasticArrays.ElasticVector{Int64, Vector{Int64}}
:reward 4-element ElasticArrays.ElasticVector{Float32, Vector{Float32}}
:terminal 4-element ElasticArrays.ElasticVector{Bool, Vector{Bool}}

julia> t[:state]
4×5 ElasticArrays.ElasticMatrix{Int64, Vector{Int64}}:
 1  2  3  4  5
 1  2  3  4  5
 1  2  3  4  5
 1  2  3  4  5

julia> t[:action]
5-element ElasticArrays.ElasticVector{Int64, Vector{Int64}}:
 1
 2
 3
 4
 5

julia> t[:reward]
4-element ElasticArrays.ElasticVector{Float32, Vector{Float32}}:
 0.5
 1.0
 1.5
 2.0

julia> t[:terminal]
4-element ElasticArrays.ElasticVector{Bool, Vector{Bool}}:
 0
 1
 0
 1

julia> empty!(t)

julia> t
Trajectory of 4 traces:
:state 4×0 ElasticArrays.ElasticMatrix{Int64, Vector{Int64}}
:action 0-element ElasticArrays.ElasticVector{Int64, Vector{Int64}}
:reward 0-element ElasticArrays.ElasticVector{Float32, Vector{Float32}}
:terminal 0-element ElasticArrays.ElasticVector{Bool, Vector{Bool}}
source
ReinforcementLearningCore.EpsilonGreedyExplorerType
EpsilonGreedyExplorer{T}(;kwargs...)
EpsilonGreedyExplorer(ϵ) -> EpsilonGreedyExplorer{:linear}(; ϵ_stable = ϵ)

Epsilon-greedy strategy: The best lever is selected for a proportion 1 - epsilon of the trials, and a lever is selected at random (with uniform probability) for a proportion epsilon . Multi-armed_bandit

Two kinds of epsilon-decreasing strategy are implemented here (linear and exp).

Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of epsilon decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish. - Multi-armed_bandit

Keywords

  • T::Symbol: defines how to calculate the epsilon in the warmup steps. Supported values are linear and exp.
  • step::Int = 1: record the current step.
  • ϵ_init::Float64 = 1.0: initial epsilon.
  • warmup_steps::Int=0: the number of steps to use ϵ_init.
  • decay_steps::Int=0: the number of steps for epsilon to decay from ϵ_init to ϵ_stable.
  • ϵ_stable::Float64: the epsilon after warmup_steps + decay_steps.
  • is_break_tie=false: randomly select an action of the same maximum values if set to true.
  • rng=Random.GLOBAL_RNG: set the internal RNG.
  • is_training=true, in training mode, step will not be updated. And the ϵ will be set to 0.

Example

s = EpsilonGreedyExplorer{:linear}(ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot([RL.get_ϵ(s, i) for i in 1:500], label="linear epsilon")

s = EpsilonGreedyExplorer{:exp}(ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot([RL.get_ϵ(s, i) for i in 1:500], label="exp epsilon")

source
ReinforcementLearningCore.EpsilonGreedyExplorerMethod
(s::EpsilonGreedyExplorer)(values; step) where T
Note

If multiple values with the same maximum value are found. Then a random one will be returned!

NaN will be filtered unless all the values are NaN. In that case, a random one will be returned.

source
ReinforcementLearningCore.ExperimentType
Experiment(policy, env, stop_condition, hook, description)

These are the four essential components in a typical reinforcement learning experiment:

  • policy, generates an action during the interaction with the env. It may update its strategy in the meanwhile.
  • env, the environment we're going to experiment with.
  • stop_condition, defines the when the experiment terminates.
  • hook, collects some intermediate data during the experiment.
  • description, displays some useful information for logging.
source
ReinforcementLearningCore.GaussianNetworkType
GaussianNetwork(;pre=identity, μ, logσ, min_σ=0f0, max_σ=Inf32)

Returns μ and logσ when called. Create a distribution to sample from using Normal.(μ, exp.(logσ)). min_σ and max_σ are used to clip the output from logσ.

source
ReinforcementLearningCore.GaussianNetworkMethod

This function is compatible with a multidimensional action space. When outputting an action, it uses tanh to normalize it.

  • rng::AbstractRNG=Random.GLOBAL_RNG
  • is_sampling::Bool=false, whether to sample from the obtained normal distribution.
  • is_return_log_prob::Bool=false, whether to calculate the conditional probability of getting actions in the given state.
source
ReinforcementLearningCore.MultiAgentManagerMethod
MultiAgentManager(player => policy...)

This is the simplest form of multiagent system. At each step they observe the environment from their own perspective and get updated independently. For environments of SEQUENTIAL style, agents which are not the current player will observe a dummy action of NO_OP in the PreActStage. For environments of SIMULTANEOUS style, please wrap it with SequentialEnv first.

source
ReinforcementLearningCore.RandomPolicyType
RandomPolicy(action_space=nothing; rng=Random.GLOBAL_RNG)

If action_space is nothing, then it will use the legal_action_space at runtime to randomly select an action. Otherwise, a random element within action_space is selected.

Note

You should always set action_space=nothing when dealing with environments of FULL_ACTION_SET.

source
ReinforcementLearningCore.StopAfterNoImprovementType

StopAfterNoImprovement()

Stop training when a monitored metric has stopped improving.

Parameters:

fn: a closure, return a scalar value, which indicates the performance of the policy (the higher the better) e.g.

  1. () -> reward(env)
  2. () -> totalrewardper_episode.reward

patience: Number of epochs with no improvement after which training will be stopped.

δ: Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.

Return true after the monitored metric has stopped improving.

source
ReinforcementLearningCore.SumTreeType
SumTree(capacity::Int)

Efficiently sample and update weights. For more details, see the post at here. Here we use a vector to represent the binary tree. Suppose we will have capacity leaves at most. Every time we push! new node into the tree, only the recent capacity node and their sum will be updated! [––––––Parent nodes––––––][––––leaves––––] [size: 2^ceil(Int, log2(capacity))-1 ][ size: capacity ]

Example

julia> t = SumTree(8)
0-element SumTree
julia> for i in 1:16
       push!(t, i)
       end
julia> t
8-element SumTree:
  9.0
 10.0
 11.0
 12.0
 13.0
 14.0
 15.0
 16.0
julia> sample(t)
(2, 10.0)
julia> sample(t)
(1, 9.0)
julia> inds, ps = sample(t,100000)
([8, 4, 8, 1, 5, 2, 2, 7, 6, 6  …  1, 1, 7, 1, 6, 1, 5, 7, 2, 7], [16.0, 12.0, 16.0, 9.0, 13.0, 10.0, 10.0, 15.0, 14.0, 14.0  …  9.0, 9.0, 15.0, 9.0, 14.0, 9.0, 13.0, 15.0, 10.0, 15.0])
julia> countmap(inds)
Dict{Int64,Int64} with 8 entries:
  7 => 14991
  4 => 12019
  2 => 10003
  3 => 11027
  5 => 12971
  8 => 16052
  6 => 13952
  1 => 8985
julia> countmap(ps)
Dict{Float64,Int64} with 8 entries:
  9.0  => 8985
  13.0 => 12971
  10.0 => 10003
  14.0 => 13952
  16.0 => 16052
  11.0 => 11027
  15.0 => 14991
  12.0 => 12019
source
ReinforcementLearningCore.TabularApproximatorType
TabularApproximator(table<:AbstractArray, opt)

For table of 1-d, it will serve as a state value approximator. For table of 2-d, it will serve as a state-action value approximator.

Warning

For table of 2-d, the first dimension is action and the second dimension is state.

source
ReinforcementLearningCore.UCBExplorerMethod
UCBExplorer(na; c=2.0, ϵ=1e-10, step=1, seed=nothing)

Arguments

  • na is the number of actions used to create a internal counter.
  • t is used to store current time step.
  • c is used to control the degree of exploration.
  • seed, set the seed of inner RNG.
  • is_training=true, in training mode, time step and counter will not be updated.
source
ReinforcementLearningCore.VBasedPolicyType
VBasedPolicy(;learner, mapping=default_value_action_mapping)

The learner must be a value learner. The mapping is a function which returns an action given env and the learner. By default we iterate through all the valid actions and select the best one which lead to the maximum state value.

source
Base.push!Method

When pushing a StackFrames into a CircularArrayBuffer of the same dimension, only the latest frame is pushed. If the StackFrames is one dimension lower, then it is treated as a general AbstractArray and is pushed in as a frame.

source
CUDA.deviceMethod
device(model)

Detect the suitable running device for the model. Return Val(:cpu) by default.

source
ReinforcementLearningBase.probMethod
prob(s::EpsilonGreedyExplorer, values) ->Categorical
prob(s::EpsilonGreedyExplorer, values, mask) ->Categorical

Return the probability of selecting each action given the estimated values of each action.

source
ReinforcementLearningCore.consecutive_viewMethod
consecutive_view(x::AbstractArray, inds; n_stack = nothing, n_horizon = nothing)

By default, it behaves the same with select_last_dim(x, inds). If n_stack is set to an int, then for each frame specified by inds, the previous n_stack frames (including the current one) are concatenated as a new dimension. If n_horizon is set to an int, then for each frame specified by inds, the next n_horizon frames (including the current one) are concatenated as a new dimension.

Example

julia> x = collect(1:5)
5-element Array{Int64,1}:
 1
 2
 3
 4
 5

julia> consecutive_view(x, [2,4])  # just the same with `select_last_dim(x, [2,4])`
2-element view(::Array{Int64,1}, [2, 4]) with eltype Int64:
 2
 4

julia> consecutive_view(x, [2,4];n_stack = 2)
2×2 view(::Array{Int64,1}, [1 3; 2 4]) with eltype Int64:
 1  3
 2  4

julia> consecutive_view(x, [2,4];n_horizon = 2)
2×2 view(::Array{Int64,1}, [2 4; 3 5]) with eltype Int64:
 2  4
 3  5

julia> consecutive_view(x, [2,4];n_horizon = 2, n_stack=2)  # note the order here, first we stack, then we apply the horizon
2×2×2 view(::Array{Int64,1}, [1 2; 2 3]

[3 4; 4 5]) with eltype Int64:
[:, :, 1] =
 1  2
 2  3

[:, :, 2] =
 3  4
 4  5

See also Frame Skipping and Preprocessing for Deep Q networks to gain a better understanding of state stacking and n-step learning.

source
ReinforcementLearningCore.discount_rewardsMethod
discount_rewards(rewards::VectorOrMatrix, γ::Number;kwargs...)

Calculate the gain started from the current step with discount rate of γ. rewards can be a matrix.

Keyword arguments

  • dims=:, if rewards is a Matrix, then dims can only be 1 or 2.
  • terminal=nothing, specify if each reward follows by a terminal. nothing means the game is not terminated yet. If terminal is provided, then the size must be the same with rewards.
  • init=nothing, init can be used to provide the the reward estimation of the last state.

Example

source
ReinforcementLearningCore.flatten_batchMethod
flatten_batch(x::AbstractArray)

Merge the last two dimension.

Example

julia> x = reshape(1:12, 2, 2, 3)
2×2×3 reshape(::UnitRange{Int64}, 2, 2, 3) with eltype Int64:
[:, :, 1] =
 1  3
 2  4

[:, :, 2] =
 5  7
 6  8

[:, :, 3] =
  9  11
 10  12

julia> flatten_batch(x)
2×6 reshape(::UnitRange{Int64}, 2, 6) with eltype Int64:
 1  3  5  7   9  11
 2  4  6  8  10  12
source
ReinforcementLearningCore.generalized_advantage_estimationMethod
generalized_advantage_estimation(rewards::VectorOrMatrix, values::VectorOrMatrix, γ::Number, λ::Number;kwargs...)

Calculate the generalized advantage estimate started from the current step with discount rate of γ and a lambda for GAE-Lambda of 'λ'. rewards and 'values' can be a matrix.

Keyword arguments

  • dims=:, if rewards is a Matrix, then dims can only be 1 or 2.
  • terminal=nothing, specify if each reward follows by a terminal. nothing means the game is not terminated yet. If terminal is provided, then the size must be the same with rewards.

Example

source
ReinforcementLearningCore.normlogpdfMethod

GPU automatic differentiable version for the logpdf function of normal distributions. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer).

source
StatsBase.sampleMethod
sample([rng=Random.GLOBAL_RNG], trajectory, sampler, [traces=Val(keys(trajectory))])
Note

Here we return a copy instead of a view:

  1. Each sample is independent of the original trajectory so that trajectory can be updated async.
  2. Copy is not always so bad.
source