ReinforcementLearningCore.jl

ReinforcementLearningCore.AbstractExplorer — Type

RLBase.plan!(p::AbstractExplorer, x[, mask])

Define how to select an action based on action values.

source

ReinforcementLearningCore.AbstractHook — Type

A hook is called at different stage during a run to allow users to inject customized runtime logic. By default, an AbstractHook will do nothing. One can customize the behavior by implementing the following methods:

Base.push!(hook::YourHook, ::PreActStage, agent, env)
Base.push!(hook::YourHook, ::PostActStage, agent, env)
Base.push!(hook::YourHook, ::PreEpisodeStage, agent, env)
Base.push!(hook::YourHook, ::PostEpisodeStage, agent, env)
Base.push!(hook::YourHook, ::PostExperimentStage, agent, env)

By convention, the Base.getindex(h::YourHook) is implemented to extract the metrics we are interested in. Users can compose different AbstractHooks with +.

source

ReinforcementLearningCore.AbstractLearner — Type

AbstractLearner

Abstract type for a learner.

source

ReinforcementLearningCore.ActorCritic — Type

ActorCritic(;actor, critic, optimizer=Adam())

The actor part must return logits (Do not use softmax in the last layer!), and the critic part must return a state value.

source

ReinforcementLearningCore.Agent — Type

Agent(;policy, trajectory) <: AbstractPolicy

A wrapper of an AbstractPolicy. Generally speaking, it does nothing but to update the trajectory and policy appropriately in different stages. Agent is a Callable and its call method accepts varargs and keyword arguments to be passed to the policy.

source

ReinforcementLearningCore.BatchExplorer — Type

BatchExplorer(explorer::AbstractExplorer)

source

ReinforcementLearningCore.BatchStepsPerEpisode — Method

BatchStepsPerEpisode(batchsize::Int; tag = "TRAINING")

Similar to StepsPerEpisode, but is specific to environments which return a Vector of rewards (a typical case with MultiThreadEnv).

source

ReinforcementLearningCore.CategoricalNetwork — Type

CategoricalNetwork(model)([rng,] state::AbstractArray [, mask::AbstractArray{Bool}]; is_sampling::Bool=false, is_return_log_prob::Bool = false)

CategoricalNetwork wraps a model (typically a neural network) that takes a state input and outputs logits for a categorical distribution. The optional argument mask must be an Array of Bool with the same size as state expect for the first dimension that must have the length of the action vector. Actions mapped to false by mask have a logit equal to -Inf and/or a zero-probability of being sampled.

rng::AbstractRNG=Random.default_rng()
is_sampling::Bool=false, whether to sample from the obtained normal categorical distribution (returns a Flux.OneHotArray z).
is_return_log_prob::Bool=false, whether to return the logits (i.e. the unnormalized logprobabilities) of getting the sampled actions in the given state.

Only applies if is_sampling is true and will return z, logits.

If is_sampling = false, returns only the logits obtained by a simple forward pass into model.

source

ReinforcementLearningCore.CategoricalNetwork — Method

(model::CategoricalNetwork)([rng::AbstractRNG,] state::AbstractArray{<:Any, 3}, [mask::AbstractArray{Bool},] action_samples::Int)

Sample action_samples actions from each state. Returns a 3D tensor with dimensions (action_size x action_samples x batchsize). Always returns the logits of each action along in a tensor with the same dimensions. The optional argument mask must be an Array of Bool with the same size as state expect for the first dimension that must have the length of the action vector. Actions mapped to false by mask have a logit equal to -Inf and/or a zero-probability of being sampled.

source

ReinforcementLearningCore.CovGaussianNetwork — Type

CovGaussianNetwork(;pre=identity, μ, Σ)

Returns μ and Σ when called where μ is the mean and Σ is a covariance matrix. Unlike GaussianNetwork, the output is 3-dimensional. μ has dimensions (action_size x 1 x batchsize) and Σ has dimensions (action_size x action_size x batchsize). The Σ head of the CovGaussianNetwork should not directly return a square matrix but a vector of length action_size x (action_size + 1) ÷ 2. This vector will contain elements of the uppertriangular cholesky decomposition of the covariance matrix, which is then reconstructed from it. Sample from MvNormal.(μ, Σ).

source

ReinforcementLearningCore.CovGaussianNetwork — Method

(model::CovGaussianNetwork)(state::AbstractArray, action::AbstractArray)

Return the logpdf of the model sampling action when in state. State must be a 3D tensor with dimensions (state_size x 1 x batchsize). Multiple actions may be taken per state, action must have dimensions (action_size x action_samples_per_state x batchsize). Returns a 3D tensor with dimensions (1 x action_samples_per_state x batchsize).

source

ReinforcementLearningCore.CovGaussianNetwork — Method

If given 2D matrices as input, will return a 2D matrix of logpdf. States and actions are paired column-wise, one action per state.

source

ReinforcementLearningCore.CovGaussianNetwork — Method

(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)

Sample action_samples actions per state in state and return the actions, logpdf(actions). This function is compatible with a multidimensional action space. The outputs are 3D tensors with dimensions (action_size x action_samples x batchsize) and (1 x action_samples x batchsize) for actions and logdpf respectively.

source

ReinforcementLearningCore.CovGaussianNetwork — Method

(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}; is_sampling::Bool=false, is_return_log_prob::Bool=false)

This function is compatible with a multidimensional action space. To work with covariance matrices, the outputs are 3D tensors. If sampling, return an actions tensor with dimensions (action_size x action_samples x batchsize) and a logp_π tensor with dimensions (1 x action_samples x batchsize). If not sampling, returns μ with dimensions (action_size x 1 x batchsize) and L, the lower triangular of the cholesky decomposition of the covariance matrix, with dimensions (action_size x action_size x batchsize) The covariance matrices can be retrieved with Σ = stack(map(l -> l*l', eachslice(L, dims=3)); dims=3)

rng::AbstractRNG=Random.default_rng()
is_sampling::Bool=false, whether to sample from the obtained normal distribution.
is_return_log_prob::Bool=false, whether to calculate the conditional probability of getting actions in the given state.

source

ReinforcementLearningCore.CovGaussianNetwork — Method

(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractMatrix; is_sampling::Bool=false, is_return_log_prob::Bool=false)

Given a Matrix of states, will return actions, μ and logpdf in matrix format. The batch of Σ remains a 3D tensor.

source

ReinforcementLearningCore.CurrentPlayerIterator — Type

CurrentPlayerIterator(env::E) where {E<:AbstractEnv}

CurrentPlayerIteratoris an iterator that iterates over the players in the environment, returning thecurrentplayer`for each iteration. This is only necessary forMultiAgentenvironments. After each iteration,RLBase.nextplayer!is called to advance thecurrentplayer. As long as`RLBase.nextplayer!is defined for the environment, this iterator will work correctly in theBase.run`` function.

source

ReinforcementLearningCore.DoEveryNEpisodes — Type

DoEveryNEpisodes(f; n=1, t=0)

Execute f(t, agent, env) every n episode. t is a counter of episodes.

source

ReinforcementLearningCore.DoEveryNSteps — Type

DoEveryNSteps(f; n=1, t=0)

Execute f(t, agent, env) every n step. t is a counter of steps.

source

ReinforcementLearningCore.DoOnExit — Type

DoOnExit(f)

Call the lambda function f at the end of an Experiment.

source

ReinforcementLearningCore.DuelingNetwork — Type

DuelingNetwork(;base, val, adv)

Dueling network automatically produces separate estimates of the state value function network and advantage function network. The expected output size of val is 1, and adv is the size of the action space.

source

ReinforcementLearningCore.EmptyHook — Type

Nothing but a placeholder.

source

ReinforcementLearningCore.EpsilonGreedyExplorer — Type

EpsilonGreedyExplorer{T}(;kwargs...)
EpsilonGreedyExplorer(ϵ) -> EpsilonGreedyExplorer{:linear}(; ϵ_stable = ϵ)

Epsilon-greedy strategy: The best lever is selected for a proportion 1 - epsilon of the trials, and a lever is selected at random (with uniform probability) for a proportion epsilon . Multi-armed_bandit

Two kinds of epsilon-decreasing strategy are implemented here (linear and exp).

Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of epsilon decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish. - Multi-armed_bandit

Keywords

T::Symbol: defines how to calculate the epsilon in the warmup steps. Supported values are linear and exp.
step::Int = 1: record the current step.
ϵ_init::Float64 = 1.0: initial epsilon.
warmup_steps::Int=0: the number of steps to use ϵ_init.
decay_steps::Int=0: the number of steps for epsilon to decay from ϵ_init to ϵ_stable.
ϵ_stable::Float64: the epsilon after warmup_steps + decay_steps.
is_break_tie=false: randomly select an action of the same maximum values if set to true.
rng=Random.default_rng(): set the internal RNG.

Example

s_lin = EpsilonGreedyExplorer(kind=:linear, ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot([RLCore.get_ϵ(s_lin, i) for i in 1:500], label="linear epsilon")
s_exp = EpsilonGreedyExplorer(kind=:exp, ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot!([RLCore.get_ϵ(s_exp, i) for i in 1:500], label="exp epsilon")

source

ReinforcementLearningCore.Experiment — Type

Experiment(policy::AbstractPolicy, env::AbstractEnv, stop_condition::AbstractStopCondition, hook::AbstractHook)

A struct to hold the information of an experiment. It is used to run an experiment with the given policy, environment, stop condition and hook.

source

ReinforcementLearningCore.FluxApproximator — Type

FluxApproximator(model, optimiser)

Wraps a Flux trainable model and implements the RLBase.optimise!(::FluxApproximator, ::Gradient) interface. See the RLCore documentation for more information on proper usage.

source

ReinforcementLearningCore.FluxApproximator — Method

FluxApproximator(; model, optimiser, usegpu=false)

Constructs an FluxApproximator object for reinforcement learning.

Arguments

model: The model used for approximation.
optimiser: The optimizer used for updating the model.
usegpu: A boolean indicating whether to use GPU for computation. Default is false.

Returns

An FluxApproximator object.

source

ReinforcementLearningCore.GaussianNetwork — Method

(model::GaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)

Sample action_samples actions from each state. Returns a 3D tensor with dimensions (action_size x action_samples x batchsize). state must be 3D tensor with dimensions (state_size x 1 x batchsize). Always returns the logpdf of each action along.

source

ReinforcementLearningCore.GaussianNetwork — Method

This function is compatible with a multidimensional action space.

rng::AbstractRNG=Random.default_rng()
is_sampling::Bool=false, whether to sample from the obtained normal distribution.
is_return_log_prob::Bool=false, whether to calculate the conditional probability of getting actions in the given state.

source

ReinforcementLearningCore.MultiAgentHook — Type

MultiAgentHook(hooks::NT) where {NT<: NamedTuple}

MultiAgentHook is a hook struct that contains <:AbstractHook structs indexed by the player's symbol.

source

ReinforcementLearningCore.MultiAgentPolicy — Type

MultiAgentPolicy(agents::NT) where {NT<: NamedTuple}

MultiAgentPolicy is a policy struct that contains <:AbstractPolicy structs indexed by the player's symbol.

source

ReinforcementLearningCore.OfflineAgent — Type

OfflineAgent(policy::AbstractPolicy, trajectory::Trajectory, offline_behavior::OfflineBehavior = OfflineBehavior()) <: AbstractAgent

OfflineAgent is an AbstractAgent that, unlike the usual online Agent, does not interact with the environment during training in order to collect data. Just like Agent, it contains an AbstractPolicy to be trained an a Trajectory that contains the training data. The difference being that the trajectory is filled prior to training and is not updated. An OfflineBehavior can optionally be provided to provide an second "behavior agent" that will generate the training data at the PreExperimentStage. Does nothing by default.

source

ReinforcementLearningCore.OfflineBehavior — Type

OfflineBehavior(; agent:: Union{<:Agent, Nothing}, steps::Int, reset_condition)

Used to provide an OfflineAgent with a "behavior agent" that will generate the training data at the PreExperimentStage. If agent is nothing (by default), does nothing. The trajectory of agent should be the same as that of the parent OfflineAgent. steps is the number of data elements to generate, defaults to the capacity of the trajectory. reset_condition is the episode reset condition for the data generation (defaults to ResetIfEnvTerminated()).

The behavior agent will interact with the main environment of the experiment to generate the data.

source

ReinforcementLearningCore.PerturbationNetwork — Method

This function accepts state and action, and then outputs actions after disturbance.

source

ReinforcementLearningCore.PlayerTuple — Type

PlayerTuple

A NamedTuple that maps players to their respective values.

source

ReinforcementLearningCore.PostActStage — Type

Stage that is executed after the Agent acts.

source

ReinforcementLearningCore.PostEpisodeStage — Type

Stage that is executed after the Episode is over.

source

ReinforcementLearningCore.PostExperimentStage — Type

Stage that is executed after the Experiment is over.

source

ReinforcementLearningCore.PreActStage — Type

Stage that is executed before the Agent acts.

source

ReinforcementLearningCore.PreEpisodeStage — Type

Stage that is executed before the Episode starts.

source

ReinforcementLearningCore.PreExperimentStage — Type

Stage that is executed before the Experiment starts.

source

ReinforcementLearningCore.QBasedPolicy — Type

QBasedPolicy(;learner, explorer)

Wraps a learner and an explorer. The learner is a struct that should predict the Q-value of each legal action of an environment at its current state. It is typically a table or a neural network. QBasedPolicy can be queried for an action with RLBase.plan!, the explorer will affect the action selection accordingly.

source

ReinforcementLearningCore.RandomPolicy — Type

RandomPolicy(action_space=nothing; rng=Random.default_rng())

If action_space is nothing, then it will use the legal_action_space at runtime to randomly select an action. Otherwise, a random element within action_space is selected.

Note

You should always set action_space=nothing when dealing with environments of FULL_ACTION_SET.

source

ReinforcementLearningCore.ResetAfterNSteps — Type

ResetAfterNSteps(n)

A reset condition that resets the environment after n steps.

source

ReinforcementLearningCore.ResetIfEnvTerminated — Type

ResetIfEnvTerminated()

A reset condition that resets the environment if is_terminated(env) is true.

source

ReinforcementLearningCore.RewardsPerEpisode — Type

RewardsPerEpisode(; rewards = Vector{Vector{Float64}}())

Store each reward of each step in every episode in the field of rewards.

source

ReinforcementLearningCore.SoftGaussianNetwork — Type

SoftGaussianNetwork(;pre=identity, μ, σ, min_σ=0f0, max_σ=Inf32, squash = tanh)

Like GaussianNetwork but with a differentiable reparameterization trick. Mainly used for SAC. Returns μ and σ when called. Create a distribution to sample from using Normal.(μ, σ). min_σ and max_σ are used to clip the output from σ. pre is a shared body before the two heads of the NN. σ should be > 0. You may enforce this using a softplus output activation. Actions are squashed by a tanh and a correction is applied to the logpdf.

source

ReinforcementLearningCore.SoftGaussianNetwork — Method

(model::SoftGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)

source

ReinforcementLearningCore.SoftGaussianNetwork — Method

This function is compatible with a multidimensional action space.

rng::AbstractRNG=Random.default_rng()
is_sampling::Bool=false, whether to sample from the obtained normal distribution.
is_return_log_prob::Bool=false, whether to calculate the conditional probability of getting actions in the given state.

source

ReinforcementLearningCore.StackFrames — Type

StackFrames(::Type{T}=Float32, d::Int...)

Use a pre-initialized CircularArrayBuffer to store the latest several states specified by d. Before processing any observation, the buffer is filled with `zero{T} by default.

source

ReinforcementLearningCore.StepsPerEpisode — Type

StepsPerEpisode(; steps = Int[], count = 0)

Store steps of each episode in the field of steps.

source

ReinforcementLearningCore.StopAfterNEpisodes — Type

StopAfterNEpisodes(episode; cur = 0, is_show_progress = true)

Return true after being called episode. If is_show_progress is true, the ProgressMeter will be used to show progress.

source

ReinforcementLearningCore.StopAfterNSeconds — Type

StopAfterNSeconds

parameter:

time budget

stop training after N seconds

source

ReinforcementLearningCore.StopAfterNSteps — Type

StopAfterNSteps(step; cur = 1, is_show_progress = true)

Return true after being called step times.

source

ReinforcementLearningCore.StopAfterNoImprovement — Type

StopAfterNoImprovement()

Stop training when a monitored metric has stopped improving.

Parameters:

fn: a closure, return a scalar value, which indicates the performance of the policy (the higher the better) e.g.

() -> reward(env)
() -> totalrewardper_episode.reward

patience: Number of epochs with no improvement after which training will be stopped.

δ: Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.

Return true after the monitored metric has stopped improving.

source

ReinforcementLearningCore.StopIfAny — Type

AnyStopCondition(stop_conditions...)

The result of stop_conditions is reduced by any.

source

ReinforcementLearningCore.StopIfEnvTerminated — Type

StopIfEnvTerminated()

Return true if the environment is terminated.

source

ReinforcementLearningCore.StopSignal — Type

StopSignal()

Create a stop signal initialized with a value of false. You can manually set it to true by s[] = true to stop the running loop at any time.

source

ReinforcementLearningCore.TDLearner — Type

TDLearner(;approximator, method, γ=1.0, α=0.01, n=0)

Use temporal-difference method to estimate state value or state-action value.

Fields

approximator is <:TabularApproximator.
γ=1.0, discount rate.
method: only :SARS (Q-learning) is supported for the time being.
n=0: the number of time steps used minus 1.

source

ReinforcementLearningCore.TabularApproximator — Method

TabularApproximator(table<:AbstractArray)

For table of 1-d, it will serve as a state value approximator. For table of 2-d, it will serve as a state-action value approximator.

Warning

For table of 2-d, the first dimension is action and the second dimension is state.

source

ReinforcementLearningCore.TabularQApproximator — Method

TabularQApproximator(; n_state, n_action, init = 0.0)

Create a TabularQApproximator with n_state states and n_action actions.

source

ReinforcementLearningCore.TargetNetwork — Type

TargetNetwork(network::FluxApproximator; sync_freq::Int = 1, ρ::Float32 = 0f0)

Wraps an FluxApproximator to hold a target network that is updated towards the model of the approximator.

sync_freq is the number of updates of network between each update of the target.
ρ ( ho) is "how much of the target is kept when updating it".

The two common usages of TargetNetwork are

use ρ = 0 to totally replace target with network every sync_freq updates.
use ρ < 1 (but close to one) and sync_freq = 1 to let the target follow network with polyak averaging.

Implements the RLBase.optimise!(::TargetNetwork, ::Gradient) interface to update the model with the gradient and the target with weights replacement or Polyak averaging.

Note to developers: model(::TargetNetwork) will return the trainable Flux model and target(::TargetNetwork) returns the target model and target(::FluxApproximator) returns the non-trainable Flux model. See the RLCore documentation.

source

ReinforcementLearningCore.TargetNetwork — Method

TargetNetwork(network; sync_freq = 1, ρ = 0f0, use_gpu = false)

Constructs a target network for reinforcement learning.

Arguments

network: The main network used for training.
sync_freq: The frequency (in number of calls to optimise!) at which the target network is synchronized with the main network. Default is 1.
ρ: The interpolation factor used for updating the target network. Must be in the range [0, 1]. Default is 0 (the old weights are completely replaced by the new ones).
use_gpu: Specifies whether to use GPU for the target network. Default is false.

Returns

A TargetNetwork object.

source

ReinforcementLearningCore.TimePerStep — Type

TimePerStep(;max_steps=100)
TimePerStep(times::CircularVectorBuffer{Float64}, t::Float64)

Store time cost in seconds of the latest max_steps in the times field.

source

ReinforcementLearningCore.TotalRewardPerEpisode — Type

TotalRewardPerEpisode(; is_display_on_exit = true)

Store the total reward of each episode in the field of rewards. If is_display_on_exit is set to true, a unicode plot will be shown at the PostExperimentStage.

source

ReinforcementLearningCore.UCBExplorer — Method

UCBExplorer(na; c=2.0, ϵ=1e-10, step=1, seed=nothing)

Arguments

na is the number of actions used to create a internal counter.
t is used to store current time step.
c is used to control the degree of exploration.
seed, set the seed of inner RNG.

source

ReinforcementLearningCore.VAE — Type

VAE(;encoder, decoder, latent_dims)

source

ReinforcementLearningCore.WeightedExplorer — Type

WeightedExplorer(;is_normalized::Bool, rng=Random.default_rng())

is_normalized is used to indicate if the fed action values are already normalized to have a sum of 1.0.

Warning

Elements are assumed to be >=0.

See also: WeightedExplorer

source

Base.push! — Method

When pushing a StackFrames into a CircularArrayBuffer of the same dimension, only the latest frame is pushed. If the StackFrames is one dimension lower, then it is treated as a general AbstractArray and is pushed in as a frame.

source

Base.run — Method

Base.run(
    multiagent_policy::MultiAgentPolicy,
    env::E,
    stop_condition,
    hook::MultiAgentHook,
    reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}

This run function dispatches games using MultiAgentPolicy and MultiAgentHook to the appropriate run function based on the Sequential or Simultaneous trait of the environment.

source

Base.run — Method

Base.run(
    multiagent_policy::MultiAgentPolicy,
    env::E,
    ::Sequential,
    stop_condition,
    hook::MultiAgentHook,
    reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}

This run function handles MultiAgent games with the Sequential trait. It iterates over the current_player for each turn in the environment, and runs the full run loop, like in the SingleAgent case. If the stop_condition is met, the function breaks out of the loop and calls optimise! on the policy again. Finally, it calls optimise! on the policy one last time and returns the MultiAgentHook.

source

Base.run — Method

Base.run(
    multiagent_policy::MultiAgentPolicy,
    env::E,
    ::Simultaneous,
    stop_condition,
    hook::MultiAgentHook,
    reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}

This run function handles MultiAgent games with the Simultaneous trait. It iterates over the players in the environment, and for each player, it selects the appropriate policy from the MultiAgentPolicy. All agent actions are collected before the environment is updated. After each player has taken an action, it calls optimise! on the policy. If the stop_condition is met, the function breaks out of the loop and calls optimise! on the policy again. Finally, it calls optimise! on the policy one last time and returns the MultiAgentHook.

source

ReinforcementLearningBase.plan! — Method

RLBase.plan!(x::BatchExplorer, values::AbstractMatrix)

Apply inner explorer to each column of values.

source

ReinforcementLearningBase.plan! — Method

RLBase.plan!(s::EpsilonGreedyExplorer, values; step) where T

Note

If multiple values with the same maximum value are found. Then a random one will be returned when is_break_tie==true.

NaN will be filtered unless all the values are NaN. In that case, a random one will be returned.

source

ReinforcementLearningBase.prob — Method

prob(p::AbstractExplorer, x, mask)

Similar to prob(p::AbstractExplorer, x), but here only the masked elements are considered.

source

ReinforcementLearningBase.prob — Method

prob(p::AbstractExplorer, x) -> AbstractDistribution

Get the action distribution given action values.

source

ReinforcementLearningBase.prob — Method

prob(s::EpsilonGreedyExplorer, values) -> Categorical
prob(s::EpsilonGreedyExplorer, values, mask) -> Categorical

Return the probability of selecting each action given the estimated values of each action.

source

ReinforcementLearningCore._discount_rewards! — Method

assuming rewards and new_rewards are Vector

source

ReinforcementLearningCore._generalized_advantage_estimation! — Method

assuming rewards and advantages are Vector

source

ReinforcementLearningCore.bellman_update! — Method

bellman_update!(app::TabularApproximator, s::Int, s_plus_one::Int, a::Int, α::Float64, π_::Float64, γ::Float64)

Update the Q-value of the given state-action pair.

source

ReinforcementLearningCore.check — Method

Inject some customized checkings here by overwriting this function

source

ReinforcementLearningCore.cholesky_matrix_to_vector_index — Method

cholesky_matrix_to_vector_index(i, j)

Return the position in a cholesky_vec (of length da) of the element of the lower triangular matrix at coordinates (i,j).

For example if cholesky_vec = [1,2,3,4,5,6], the corresponding lower triangular matrix is

L = [1 0 0
     2 4 0
     3 5 6]

and cholesky_matrix_to_vector_index(3, 2) == 5

source

ReinforcementLearningCore.diagnormkldivergence — Method

diagnormkldivergence(μ1, σ1, μ2, σ2)

GPU differentiable implementation of the kl_divergence between two MultiVariate Gaussian distributions with mean vectors μ1, μ2 respectively and diagonal standard deviations σ1, σ2. Arguments must be Vectors or arrays of column vectors.

source

ReinforcementLearningCore.diagnormlogpdf — Method

diagnormlogpdf(μ, σ, x; ϵ = 1.0f-8)

GPU compatible and automatically differentiable version for the logpdf function of normal distributions with diagonal covariance. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer). Accepts arguments of the same shape: vectors, matrices or 3D array (with dimension 2 of size 1).

source

ReinforcementLearningCore.discount_rewards — Method

discount_rewards(rewards::VectorOrMatrix, γ::Number;kwargs...)

Calculate the gain started from the current step with discount rate of γ. rewards can be a matrix.

Keyword arguments

dims=:, if rewards is a Matrix, then dims can only be 1 or 2.
terminal=nothing, specify if each reward follows by a terminal. nothing means the game is not terminated yet. If terminal is provided, then the size must be the same with rewards.
init=nothing, init can be used to provide the the reward estimation of the last state.

Example

source

ReinforcementLearningCore.flatten_batch — Method

flatten_batch(x::AbstractArray)

Merge the last two dimension.

Example

julia> x = reshape(1:12, 2, 2, 3)
2×2×3 reshape(::UnitRange{Int64}, 2, 2, 3) with eltype Int64:
[:, :, 1] =
 1  3
 2  4

[:, :, 2] =
 5  7
 6  8

[:, :, 3] =
  9  11
 10  12

julia> flatten_batch(x)
2×6 reshape(::UnitRange{Int64}, 2, 6) with eltype Int64:
 1  3  5  7   9  11
 2  4  6  8  10  12

source

ReinforcementLearningCore.generalized_advantage_estimation — Method

generalized_advantage_estimation(rewards::VectorOrMatrix, values::VectorOrMatrix, γ::Number, λ::Number;kwargs...)

Calculate the generalized advantage estimate started from the current step with discount rate of γ and a lambda for GAE-Lambda of 'λ'. rewards and 'values' can be a matrix.

Keyword arguments

dims=:, if rewards is a Matrix, then dims can only be 1 or 2.
terminal=nothing, specify if each reward follows by a terminal. nothing means the game is not terminated yet. If terminal is provided, then the size must be the same with rewards.

Example

source

ReinforcementLearningCore.logdetLorU — Method

logdetLorU(LorU::AbstractMatrix)

Log-determinant of the Positive-Semi-Definite matrix A = L*U (cholesky lower and upper triangulars), given L or U. Has a sign uncertainty for non PSD matrices.

source

ReinforcementLearningCore.mvnormkldivergence — Method

mvnormkldivergence(μ1, L1, μ2, L2)

GPU differentiable implementation of the kl_divergence between two MultiVariate Gaussian distributions with mean vectors μ1, μ2 respectively and with cholesky decomposition of covariance matrices L1, L2.

source

ReinforcementLearningCore.mvnormlogpdf — Method

mvnormlogpdf(μ::AbstractVecOrMat, L::AbstractMatrix, x::AbstractVecOrMat)

GPU compatible and automatically differentiable version for the logpdf function of multivariate normal distributions. Takes as inputs mu the mean vector, L the lower triangular matrix of the cholesky decomposition of the covariance matrix, and x a matrix of samples where each column is a sample. Return a Vector containing the logpdf of each column of x for the MvNormal parametrized by μ and Σ = L*L'.

source

ReinforcementLearningCore.mvnormlogpdf — Method

mvnormlogpdf(μ::A, LorU::A, x::A; ϵ = 1f-8) where A <: AbstractArray

Batch version that takes 3D tensors as input where each slice along the 3rd dimension is a batch sample. μ is a (actionsize x 1 x batchsize) matrix, L is a (actionsize x actionsize x batchsize), x is a (actionsize x actionsamples x batchsize). Return a 3D matrix of size (1 x actionsamples x batchsize).

source

ReinforcementLearningCore.normkldivergence — Method

normkldivergence(μ1, σ1, μ2, σ2)

GPU differentiable implementation of the kl_divergence between two univariate Gaussian distributions with means μ1, μ2 and standard deviations σ1, σ2 respectively.

source

ReinforcementLearningCore.normlogpdf — Method

 normlogpdf(μ, σ, x; ϵ = 1.0f-8)

GPU automatic differentiable version for the logpdf function of a univariate normal distribution. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer).

source

ReinforcementLearningCore.vec_to_tril — Method

Transform a vector containing the non-zero elements of a lower triangular da x da matrix into that matrix.

source

In addition to containing the run loop, RLCore is a collection of pre-implemented components that are frequently used in RL.

QBasedPolicy

QBasedPolicy is an AbstractPolicy that wraps a Q-Value learner (tabular or approximated) and an explorer. Use this wrapper to implement a policy that directly uses a Q-value function to decide its next action. In that case, instead of creating an AbstractPolicy subtype for your algorithm, define an AbstractLearner subtype and specialize RLBase.optimise!(::YourLearnerType, ::Stage, ::Trajectory). This way you will not have to code the interaction between your policy and the explorer yourself. RLCore provides the most common explorers (such as epsilon-greedy, UCB, etc.). You can find many examples of QBasedPolicies in the DQNs section of RLZoo.

Parametric approximators

Approximator

If your algorithm uses a neural network or a linear approximator to approximate a function trained with Flux.jl, use the Approximator. It wraps a Flux model and an Optimiser (such as Adam or SGD). Your optimise!(::PolicyOrLearner, batch) function will probably consist in computing a gradient and call the RLBase.optimise!(app::Approximator, gradient::Flux.Grads) after that.

Approximator implements the model(::Approximator) and target(::Approximator) interface. Both return the underlying Flux model. The advantage of this interface is explained in the TargetNetwork section below.

TargetNetwork

The use of a target network is frequent in state or action value-based RL. The principle is to hold a copy of of the main approximator, which is trained using a gradient, and a copy of it that is either only partially updated, or just less frequently updated. TargetNetwork is constructed by wrapping an Approximator. Set the sync_freq keyword argument to a value greater that one to copy the main model into the target every sync_freq updates, or set the \rho parameter to a value greater than 0 (usually 0.99f0) to let the target be partially updated towards the main model every update. RLBase.optimise!(tn::TargetNetwork, gradient::Flux.Grads) will take care of updating the target for you.

The other advantage of TargetNetwork is that it uses Julia's multiple dispatch to let your algorithm be agnostic to the presence or absence of a target network. For example, the DQNLearner in RLZoo has an approximator field typed to be a Union{Approximator, TargetNetwork}. When computing the temporal difference error, the learner calls Q = model(learner.approximator) and Qt = target(learner.approximator). If learner.approximator is a Approximator, then no target network is used because both calls point to the same neural network, if it is a TargetNetwork then the automatically managed target is returned.

Architectures

Common model architectures are also provided such as the GaussianNetwork for continuous policies with diagonal multivariate policies; and CovGaussianNetwork for full covariance (very slow on GPUs at the moment).