ReinforcementLearningCore.jl

ReinforcementLearningCore.AbstractHookType

A hook is called at different stage during a run to allow users to inject customized runtime logic. By default, an AbstractHook will do nothing. One can customize the behavior by implementing the following methods:

  • Base.push!(hook::YourHook, ::PreActStage, agent, env)
  • Base.push!(hook::YourHook, ::PostActStage, agent, env)
  • Base.push!(hook::YourHook, ::PreEpisodeStage, agent, env)
  • Base.push!(hook::YourHook, ::PostEpisodeStage, agent, env)
  • Base.push!(hook::YourHook, ::PostExperimentStage, agent, env)

By convention, the Base.getindex(h::YourHook) is implemented to extract the metrics we are interested in. Users can compose different AbstractHooks with +.

source
ReinforcementLearningCore.ActorCriticType
ActorCritic(;actor, critic, optimizer=Adam())

The actor part must return logits (Do not use softmax in the last layer!), and the critic part must return a state value.

source
ReinforcementLearningCore.AgentType
Agent(;policy, trajectory) <: AbstractPolicy

A wrapper of an AbstractPolicy. Generally speaking, it does nothing but to update the trajectory and policy appropriately in different stages. Agent is a Callable and its call method accepts varargs and keyword arguments to be passed to the policy.

source
ReinforcementLearningCore.CategoricalNetworkType
CategoricalNetwork(model)([rng,] state::AbstractArray [, mask::AbstractArray{Bool}]; is_sampling::Bool=false, is_return_log_prob::Bool = false)

CategoricalNetwork wraps a model (typically a neural network) that takes a state input and outputs logits for a categorical distribution. The optional argument mask must be an Array of Bool with the same size as state expect for the first dimension that must have the length of the action vector. Actions mapped to false by mask have a logit equal to -Inf and/or a zero-probability of being sampled.

  • rng::AbstractRNG=Random.default_rng()
  • is_sampling::Bool=false, whether to sample from the obtained normal categorical distribution (returns a Flux.OneHotArray z).
  • is_return_log_prob::Bool=false, whether to return the logits (i.e. the unnormalized logprobabilities) of getting the sampled actions in the given state.

Only applies if is_sampling is true and will return z, logits.

If is_sampling = false, returns only the logits obtained by a simple forward pass into model.

source
ReinforcementLearningCore.CategoricalNetworkMethod
(model::CategoricalNetwork)([rng::AbstractRNG,] state::AbstractArray{<:Any, 3}, [mask::AbstractArray{Bool},] action_samples::Int)

Sample action_samples actions from each state. Returns a 3D tensor with dimensions (action_size x action_samples x batchsize). Always returns the logits of each action along in a tensor with the same dimensions. The optional argument mask must be an Array of Bool with the same size as state expect for the first dimension that must have the length of the action vector. Actions mapped to false by mask have a logit equal to -Inf and/or a zero-probability of being sampled.

source
ReinforcementLearningCore.CovGaussianNetworkType
CovGaussianNetwork(;pre=identity, μ, Σ)

Returns μ and Σ when called where μ is the mean and Σ is a covariance matrix. Unlike GaussianNetwork, the output is 3-dimensional. μ has dimensions (action_size x 1 x batchsize) and Σ has dimensions (action_size x action_size x batchsize). The Σ head of the CovGaussianNetwork should not directly return a square matrix but a vector of length action_size x (action_size + 1) ÷ 2. This vector will contain elements of the uppertriangular cholesky decomposition of the covariance matrix, which is then reconstructed from it. Sample from MvNormal.(μ, Σ).

source
ReinforcementLearningCore.CovGaussianNetworkMethod
(model::CovGaussianNetwork)(state::AbstractArray, action::AbstractArray)

Return the logpdf of the model sampling action when in state. State must be a 3D tensor with dimensions (state_size x 1 x batchsize). Multiple actions may be taken per state, action must have dimensions (action_size x action_samples_per_state x batchsize). Returns a 3D tensor with dimensions (1 x action_samples_per_state x batchsize).

source
ReinforcementLearningCore.CovGaussianNetworkMethod
(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)

Sample action_samples actions per state in state and return the actions, logpdf(actions). This function is compatible with a multidimensional action space. The outputs are 3D tensors with dimensions (action_size x action_samples x batchsize) and (1 x action_samples x batchsize) for actions and logdpf respectively.

source
ReinforcementLearningCore.CovGaussianNetworkMethod
(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}; is_sampling::Bool=false, is_return_log_prob::Bool=false)

This function is compatible with a multidimensional action space. To work with covariance matrices, the outputs are 3D tensors. If sampling, return an actions tensor with dimensions (action_size x action_samples x batchsize) and a logp_π tensor with dimensions (1 x action_samples x batchsize). If not sampling, returns μ with dimensions (action_size x 1 x batchsize) and L, the lower triangular of the cholesky decomposition of the covariance matrix, with dimensions (action_size x action_size x batchsize) The covariance matrices can be retrieved with Σ = stack(map(l -> l*l', eachslice(L, dims=3)); dims=3)

  • rng::AbstractRNG=Random.default_rng()
  • is_sampling::Bool=false, whether to sample from the obtained normal distribution.
  • is_return_log_prob::Bool=false, whether to calculate the conditional probability of getting actions in the given state.
source
ReinforcementLearningCore.CovGaussianNetworkMethod
(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractMatrix; is_sampling::Bool=false, is_return_log_prob::Bool=false)

Given a Matrix of states, will return actions, μ and logpdf in matrix format. The batch of Σ remains a 3D tensor.

source
ReinforcementLearningCore.CurrentPlayerIteratorType
CurrentPlayerIterator(env::E) where {E<:AbstractEnv}

CurrentPlayerIteratoris an iterator that iterates over the players in the environment, returning thecurrentplayer`for each iteration. This is only necessary forMultiAgentenvironments. After each iteration,RLBase.nextplayer!is called to advance thecurrentplayer. As long as`RLBase.nextplayer!is defined for the environment, this iterator will work correctly in theBase.run`` function.

source
ReinforcementLearningCore.DuelingNetworkType
DuelingNetwork(;base, val, adv)

Dueling network automatically produces separate estimates of the state value function network and advantage function network. The expected output size of val is 1, and adv is the size of the action space.

source
ReinforcementLearningCore.EpsilonGreedyExplorerType
EpsilonGreedyExplorer{T}(;kwargs...)
EpsilonGreedyExplorer(ϵ) -> EpsilonGreedyExplorer{:linear}(; ϵ_stable = ϵ)

Epsilon-greedy strategy: The best lever is selected for a proportion 1 - epsilon of the trials, and a lever is selected at random (with uniform probability) for a proportion epsilon . Multi-armed_bandit

Two kinds of epsilon-decreasing strategy are implemented here (linear and exp).

Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of epsilon decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish. - Multi-armed_bandit

Keywords

  • T::Symbol: defines how to calculate the epsilon in the warmup steps. Supported values are linear and exp.
  • step::Int = 1: record the current step.
  • ϵ_init::Float64 = 1.0: initial epsilon.
  • warmup_steps::Int=0: the number of steps to use ϵ_init.
  • decay_steps::Int=0: the number of steps for epsilon to decay from ϵ_init to ϵ_stable.
  • ϵ_stable::Float64: the epsilon after warmup_steps + decay_steps.
  • is_break_tie=false: randomly select an action of the same maximum values if set to true.
  • rng=Random.default_rng(): set the internal RNG.

Example

s_lin = EpsilonGreedyExplorer(kind=:linear, ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot([RLCore.get_ϵ(s_lin, i) for i in 1:500], label="linear epsilon")
s_exp = EpsilonGreedyExplorer(kind=:exp, ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot!([RLCore.get_ϵ(s_exp, i) for i in 1:500], label="exp epsilon")

source
ReinforcementLearningCore.ExperimentType
Experiment(policy::AbstractPolicy, env::AbstractEnv, stop_condition::AbstractStopCondition, hook::AbstractHook)

A struct to hold the information of an experiment. It is used to run an experiment with the given policy, environment, stop condition and hook.

source
ReinforcementLearningCore.FluxApproximatorType
FluxApproximator(model, optimiser)

Wraps a Flux trainable model and implements the RLBase.optimise!(::FluxApproximator, ::Gradient) interface. See the RLCore documentation for more information on proper usage.

source
ReinforcementLearningCore.FluxApproximatorMethod
FluxApproximator(; model, optimiser, usegpu=false)

Constructs an FluxApproximator object for reinforcement learning.

Arguments

  • model: The model used for approximation.
  • optimiser: The optimizer used for updating the model.
  • usegpu: A boolean indicating whether to use GPU for computation. Default is false.

Returns

An FluxApproximator object.

source
ReinforcementLearningCore.GaussianNetworkMethod
(model::GaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)

Sample action_samples actions from each state. Returns a 3D tensor with dimensions (action_size x action_samples x batchsize). state must be 3D tensor with dimensions (state_size x 1 x batchsize). Always returns the logpdf of each action along.

source
ReinforcementLearningCore.GaussianNetworkMethod

This function is compatible with a multidimensional action space.

  • rng::AbstractRNG=Random.default_rng()
  • is_sampling::Bool=false, whether to sample from the obtained normal distribution.
  • is_return_log_prob::Bool=false, whether to calculate the conditional probability of getting actions in the given state.
source
ReinforcementLearningCore.OfflineAgentType
OfflineAgent(policy::AbstractPolicy, trajectory::Trajectory, offline_behavior::OfflineBehavior = OfflineBehavior()) <: AbstractAgent

OfflineAgent is an AbstractAgent that, unlike the usual online Agent, does not interact with the environment during training in order to collect data. Just like Agent, it contains an AbstractPolicy to be trained an a Trajectory that contains the training data. The difference being that the trajectory is filled prior to training and is not updated. An OfflineBehavior can optionally be provided to provide an second "behavior agent" that will generate the training data at the PreExperimentStage. Does nothing by default.

source
ReinforcementLearningCore.OfflineBehaviorType
OfflineBehavior(; agent:: Union{<:Agent, Nothing}, steps::Int, reset_condition)

Used to provide an OfflineAgent with a "behavior agent" that will generate the training data at the PreExperimentStage. If agent is nothing (by default), does nothing. The trajectory of agent should be the same as that of the parent OfflineAgent. steps is the number of data elements to generate, defaults to the capacity of the trajectory. reset_condition is the episode reset condition for the data generation (defaults to ResetIfEnvTerminated()).

The behavior agent will interact with the main environment of the experiment to generate the data.

source
ReinforcementLearningCore.QBasedPolicyType
QBasedPolicy(;learner, explorer)

Wraps a learner and an explorer. The learner is a struct that should predict the Q-value of each legal action of an environment at its current state. It is typically a table or a neural network. QBasedPolicy can be queried for an action with RLBase.plan!, the explorer will affect the action selection accordingly.

source
ReinforcementLearningCore.RandomPolicyType
RandomPolicy(action_space=nothing; rng=Random.default_rng())

If action_space is nothing, then it will use the legal_action_space at runtime to randomly select an action. Otherwise, a random element within action_space is selected.

Note

You should always set action_space=nothing when dealing with environments of FULL_ACTION_SET.

source
ReinforcementLearningCore.SoftGaussianNetworkType
SoftGaussianNetwork(;pre=identity, μ, σ, min_σ=0f0, max_σ=Inf32, squash = tanh)

Like GaussianNetwork but with a differentiable reparameterization trick. Mainly used for SAC. Returns μ and σ when called. Create a distribution to sample from using Normal.(μ, σ). min_σ and max_σ are used to clip the output from σ. pre is a shared body before the two heads of the NN. σ should be > 0. You may enforce this using a softplus output activation. Actions are squashed by a tanh and a correction is applied to the logpdf.

source
ReinforcementLearningCore.SoftGaussianNetworkMethod
(model::SoftGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)

Sample action_samples actions from each state. Returns a 3D tensor with dimensions (action_size x action_samples x batchsize). state must be 3D tensor with dimensions (state_size x 1 x batchsize). Always returns the logpdf of each action along.

source
ReinforcementLearningCore.SoftGaussianNetworkMethod

This function is compatible with a multidimensional action space.

  • rng::AbstractRNG=Random.default_rng()
  • is_sampling::Bool=false, whether to sample from the obtained normal distribution.
  • is_return_log_prob::Bool=false, whether to calculate the conditional probability of getting actions in the given state.
source
ReinforcementLearningCore.StackFramesType
StackFrames(::Type{T}=Float32, d::Int...)

Use a pre-initialized CircularArrayBuffer to store the latest several states specified by d. Before processing any observation, the buffer is filled with `zero{T} by default.

source
ReinforcementLearningCore.StopAfterNoImprovementType

StopAfterNoImprovement()

Stop training when a monitored metric has stopped improving.

Parameters:

fn: a closure, return a scalar value, which indicates the performance of the policy (the higher the better) e.g.

  1. () -> reward(env)
  2. () -> totalrewardper_episode.reward

patience: Number of epochs with no improvement after which training will be stopped.

δ: Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.

Return true after the monitored metric has stopped improving.

source
ReinforcementLearningCore.TDLearnerType
TDLearner(;approximator, method, γ=1.0, α=0.01, n=0)

Use temporal-difference method to estimate state value or state-action value.

Fields

  • approximator is <:TabularApproximator.
  • γ=1.0, discount rate.
  • method: only :SARS (Q-learning) is supported for the time being.
  • n=0: the number of time steps used minus 1.
source
ReinforcementLearningCore.TabularApproximatorMethod
TabularApproximator(table<:AbstractArray)

For table of 1-d, it will serve as a state value approximator. For table of 2-d, it will serve as a state-action value approximator.

Warning

For table of 2-d, the first dimension is action and the second dimension is state.

source
ReinforcementLearningCore.TargetNetworkType
TargetNetwork(network::FluxApproximator; sync_freq::Int = 1, ρ::Float32 = 0f0)

Wraps an FluxApproximator to hold a target network that is updated towards the model of the approximator.

  • sync_freq is the number of updates of network between each update of the target.
  • ρ ( ho) is "how much of the target is kept when updating it".

The two common usages of TargetNetwork are

  • use ρ = 0 to totally replace target with network every sync_freq updates.
  • use ρ < 1 (but close to one) and sync_freq = 1 to let the target follow network with polyak averaging.

Implements the RLBase.optimise!(::TargetNetwork, ::Gradient) interface to update the model with the gradient and the target with weights replacement or Polyak averaging.

Note to developers: model(::TargetNetwork) will return the trainable Flux model and target(::TargetNetwork) returns the target model and target(::FluxApproximator) returns the non-trainable Flux model. See the RLCore documentation.

source
ReinforcementLearningCore.TargetNetworkMethod
TargetNetwork(network; sync_freq = 1, ρ = 0f0, use_gpu = false)

Constructs a target network for reinforcement learning.

Arguments

  • network: The main network used for training.
  • sync_freq: The frequency (in number of calls to optimise!) at which the target network is synchronized with the main network. Default is 1.
  • ρ: The interpolation factor used for updating the target network. Must be in the range [0, 1]. Default is 0 (the old weights are completely replaced by the new ones).
  • use_gpu: Specifies whether to use GPU for the target network. Default is false.

Returns

A TargetNetwork object.

source
ReinforcementLearningCore.UCBExplorerMethod
UCBExplorer(na; c=2.0, ϵ=1e-10, step=1, seed=nothing)

Arguments

  • na is the number of actions used to create a internal counter.
  • t is used to store current time step.
  • c is used to control the degree of exploration.
  • seed, set the seed of inner RNG.
source
Base.push!Method

When pushing a StackFrames into a CircularArrayBuffer of the same dimension, only the latest frame is pushed. If the StackFrames is one dimension lower, then it is treated as a general AbstractArray and is pushed in as a frame.

source
Base.runMethod
Base.run(
    multiagent_policy::MultiAgentPolicy,
    env::E,
    stop_condition,
    hook::MultiAgentHook,
    reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}

This run function dispatches games using MultiAgentPolicy and MultiAgentHook to the appropriate run function based on the Sequential or Simultaneous trait of the environment.

source
Base.runMethod
Base.run(
    multiagent_policy::MultiAgentPolicy,
    env::E,
    ::Sequential,
    stop_condition,
    hook::MultiAgentHook,
    reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}

This run function handles MultiAgent games with the Sequential trait. It iterates over the current_player for each turn in the environment, and runs the full run loop, like in the SingleAgent case. If the stop_condition is met, the function breaks out of the loop and calls optimise! on the policy again. Finally, it calls optimise! on the policy one last time and returns the MultiAgentHook.

source
Base.runMethod
Base.run(
    multiagent_policy::MultiAgentPolicy,
    env::E,
    ::Simultaneous,
    stop_condition,
    hook::MultiAgentHook,
    reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}

This run function handles MultiAgent games with the Simultaneous trait. It iterates over the players in the environment, and for each player, it selects the appropriate policy from the MultiAgentPolicy. All agent actions are collected before the environment is updated. After each player has taken an action, it calls optimise! on the policy. If the stop_condition is met, the function breaks out of the loop and calls optimise! on the policy again. Finally, it calls optimise! on the policy one last time and returns the MultiAgentHook.

source
ReinforcementLearningBase.plan!Method
RLBase.plan!(s::EpsilonGreedyExplorer, values; step) where T
Note

If multiple values with the same maximum value are found. Then a random one will be returned when is_break_tie==true.

NaN will be filtered unless all the values are NaN. In that case, a random one will be returned.

source
ReinforcementLearningBase.probMethod
prob(s::EpsilonGreedyExplorer, values) -> Categorical
prob(s::EpsilonGreedyExplorer, values, mask) -> Categorical

Return the probability of selecting each action given the estimated values of each action.

source
ReinforcementLearningCore.cholesky_matrix_to_vector_indexMethod
cholesky_matrix_to_vector_index(i, j)

Return the position in a cholesky_vec (of length da) of the element of the lower triangular matrix at coordinates (i,j).

For example if cholesky_vec = [1,2,3,4,5,6], the corresponding lower triangular matrix is

L = [1 0 0
     2 4 0
     3 5 6]

and cholesky_matrix_to_vector_index(3, 2) == 5

source
ReinforcementLearningCore.diagnormkldivergenceMethod
diagnormkldivergence(μ1, σ1, μ2, σ2)

GPU differentiable implementation of the kl_divergence between two MultiVariate Gaussian distributions with mean vectors μ1, μ2 respectively and diagonal standard deviations σ1, σ2. Arguments must be Vectors or arrays of column vectors.

source
ReinforcementLearningCore.diagnormlogpdfMethod
diagnormlogpdf(μ, σ, x; ϵ = 1.0f-8)

GPU compatible and automatically differentiable version for the logpdf function of normal distributions with diagonal covariance. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer). Accepts arguments of the same shape: vectors, matrices or 3D array (with dimension 2 of size 1).

source
ReinforcementLearningCore.discount_rewardsMethod
discount_rewards(rewards::VectorOrMatrix, γ::Number;kwargs...)

Calculate the gain started from the current step with discount rate of γ. rewards can be a matrix.

Keyword arguments

  • dims=:, if rewards is a Matrix, then dims can only be 1 or 2.
  • terminal=nothing, specify if each reward follows by a terminal. nothing means the game is not terminated yet. If terminal is provided, then the size must be the same with rewards.
  • init=nothing, init can be used to provide the the reward estimation of the last state.

Example

source
ReinforcementLearningCore.flatten_batchMethod
flatten_batch(x::AbstractArray)

Merge the last two dimension.

Example

julia> x = reshape(1:12, 2, 2, 3)
2×2×3 reshape(::UnitRange{Int64}, 2, 2, 3) with eltype Int64:
[:, :, 1] =
 1  3
 2  4

[:, :, 2] =
 5  7
 6  8

[:, :, 3] =
  9  11
 10  12

julia> flatten_batch(x)
2×6 reshape(::UnitRange{Int64}, 2, 6) with eltype Int64:
 1  3  5  7   9  11
 2  4  6  8  10  12
source
ReinforcementLearningCore.generalized_advantage_estimationMethod
generalized_advantage_estimation(rewards::VectorOrMatrix, values::VectorOrMatrix, γ::Number, λ::Number;kwargs...)

Calculate the generalized advantage estimate started from the current step with discount rate of γ and a lambda for GAE-Lambda of 'λ'. rewards and 'values' can be a matrix.

Keyword arguments

  • dims=:, if rewards is a Matrix, then dims can only be 1 or 2.
  • terminal=nothing, specify if each reward follows by a terminal. nothing means the game is not terminated yet. If terminal is provided, then the size must be the same with rewards.

Example

source
ReinforcementLearningCore.logdetLorUMethod
logdetLorU(LorU::AbstractMatrix)

Log-determinant of the Positive-Semi-Definite matrix A = L*U (cholesky lower and upper triangulars), given L or U. Has a sign uncertainty for non PSD matrices.

source
ReinforcementLearningCore.mvnormkldivergenceMethod
mvnormkldivergence(μ1, L1, μ2, L2)

GPU differentiable implementation of the kl_divergence between two MultiVariate Gaussian distributions with mean vectors μ1, μ2 respectively and with cholesky decomposition of covariance matrices L1, L2.

source
ReinforcementLearningCore.mvnormlogpdfMethod
mvnormlogpdf(μ::AbstractVecOrMat, L::AbstractMatrix, x::AbstractVecOrMat)

GPU compatible and automatically differentiable version for the logpdf function of multivariate normal distributions. Takes as inputs mu the mean vector, L the lower triangular matrix of the cholesky decomposition of the covariance matrix, and x a matrix of samples where each column is a sample. Return a Vector containing the logpdf of each column of x for the MvNormal parametrized by μ and Σ = L*L'.

source
ReinforcementLearningCore.mvnormlogpdfMethod
mvnormlogpdf(μ::A, LorU::A, x::A; ϵ = 1f-8) where A <: AbstractArray

Batch version that takes 3D tensors as input where each slice along the 3rd dimension is a batch sample. μ is a (actionsize x 1 x batchsize) matrix, L is a (actionsize x actionsize x batchsize), x is a (actionsize x actionsamples x batchsize). Return a 3D matrix of size (1 x actionsamples x batchsize).

source
ReinforcementLearningCore.normkldivergenceMethod
normkldivergence(μ1, σ1, μ2, σ2)

GPU differentiable implementation of the kl_divergence between two univariate Gaussian distributions with means μ1, μ2 and standard deviations σ1, σ2 respectively.

source
ReinforcementLearningCore.normlogpdfMethod
 normlogpdf(μ, σ, x; ϵ = 1.0f-8)

GPU automatic differentiable version for the logpdf function of a univariate normal distribution. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer).

source

In addition to containing the run loop, RLCore is a collection of pre-implemented components that are frequently used in RL.

QBasedPolicy

QBasedPolicy is an AbstractPolicy that wraps a Q-Value learner (tabular or approximated) and an explorer. Use this wrapper to implement a policy that directly uses a Q-value function to decide its next action. In that case, instead of creating an AbstractPolicy subtype for your algorithm, define an AbstractLearner subtype and specialize RLBase.optimise!(::YourLearnerType, ::Stage, ::Trajectory). This way you will not have to code the interaction between your policy and the explorer yourself. RLCore provides the most common explorers (such as epsilon-greedy, UCB, etc.). You can find many examples of QBasedPolicies in the DQNs section of RLZoo.

Parametric approximators

Approximator

If your algorithm uses a neural network or a linear approximator to approximate a function trained with Flux.jl, use the Approximator. It wraps a Flux model and an Optimiser (such as Adam or SGD). Your optimise!(::PolicyOrLearner, batch) function will probably consist in computing a gradient and call the RLBase.optimise!(app::Approximator, gradient::Flux.Grads) after that.

Approximator implements the model(::Approximator) and target(::Approximator) interface. Both return the underlying Flux model. The advantage of this interface is explained in the TargetNetwork section below.

TargetNetwork

The use of a target network is frequent in state or action value-based RL. The principle is to hold a copy of of the main approximator, which is trained using a gradient, and a copy of it that is either only partially updated, or just less frequently updated. TargetNetwork is constructed by wrapping an Approximator. Set the sync_freq keyword argument to a value greater that one to copy the main model into the target every sync_freq updates, or set the \rho parameter to a value greater than 0 (usually 0.99f0) to let the target be partially updated towards the main model every update. RLBase.optimise!(tn::TargetNetwork, gradient::Flux.Grads) will take care of updating the target for you.

The other advantage of TargetNetwork is that it uses Julia's multiple dispatch to let your algorithm be agnostic to the presence or absence of a target network. For example, the DQNLearner in RLZoo has an approximator field typed to be a Union{Approximator, TargetNetwork}. When computing the temporal difference error, the learner calls Q = model(learner.approximator) and Qt = target(learner.approximator). If learner.approximator is a Approximator, then no target network is used because both calls point to the same neural network, if it is a TargetNetwork then the automatically managed target is returned.

Architectures

Common model architectures are also provided such as the GaussianNetwork for continuous policies with diagonal multivariate policies; and CovGaussianNetwork for full covariance (very slow on GPUs at the moment).