ReinforcementLearningCore.jl
ReinforcementLearningCore.AbstractExplorer
— TypeRLBase.plan!(p::AbstractExplorer, x[, mask])
Define how to select an action based on action values.
ReinforcementLearningCore.AbstractHook
— TypeA hook is called at different stage during a run
to allow users to inject customized runtime logic. By default, an AbstractHook
will do nothing. One can customize the behavior by implementing the following methods:
Base.push!(hook::YourHook, ::PreActStage, agent, env)
Base.push!(hook::YourHook, ::PostActStage, agent, env)
Base.push!(hook::YourHook, ::PreEpisodeStage, agent, env)
Base.push!(hook::YourHook, ::PostEpisodeStage, agent, env)
Base.push!(hook::YourHook, ::PostExperimentStage, agent, env)
By convention, the Base.getindex(h::YourHook)
is implemented to extract the metrics we are interested in. Users can compose different AbstractHook
s with +
.
ReinforcementLearningCore.AbstractLearner
— TypeAbstractLearner
Abstract type for a learner.
ReinforcementLearningCore.ActorCritic
— TypeActorCritic(;actor, critic, optimizer=Adam())
The actor
part must return logits (Do not use softmax in the last layer!), and the critic
part must return a state value.
ReinforcementLearningCore.Agent
— TypeAgent(;policy, trajectory) <: AbstractPolicy
A wrapper of an AbstractPolicy
. Generally speaking, it does nothing but to update the trajectory and policy appropriately in different stages. Agent is a Callable and its call method accepts varargs and keyword arguments to be passed to the policy.
ReinforcementLearningCore.BatchExplorer
— TypeBatchExplorer(explorer::AbstractExplorer)
ReinforcementLearningCore.BatchStepsPerEpisode
— MethodBatchStepsPerEpisode(batchsize::Int; tag = "TRAINING")
Similar to StepsPerEpisode
, but is specific to environments which return a Vector
of rewards (a typical case with MultiThreadEnv
).
ReinforcementLearningCore.CategoricalNetwork
— TypeCategoricalNetwork(model)([rng,] state::AbstractArray [, mask::AbstractArray{Bool}]; is_sampling::Bool=false, is_return_log_prob::Bool = false)
CategoricalNetwork wraps a model (typically a neural network) that takes a state
input and outputs logits for a categorical distribution. The optional argument mask
must be an Array of Bool
with the same size as state
expect for the first dimension that must have the length of the action vector. Actions mapped to false
by mask have a logit equal to -Inf
and/or a zero-probability of being sampled.
rng::AbstractRNG=Random.default_rng()
is_sampling::Bool=false
, whether to sample from the obtained normal categorical distribution (returns a Flux.OneHotArrayz
).is_return_log_prob::Bool=false
, whether to return the logits (i.e. the unnormalized logprobabilities) of getting the sampled actions in the given state.
Only applies if is_sampling
is true and will return z, logits
.
If is_sampling = false
, returns only the logits obtained by a simple forward pass into model
.
ReinforcementLearningCore.CategoricalNetwork
— Method(model::CategoricalNetwork)([rng::AbstractRNG,] state::AbstractArray{<:Any, 3}, [mask::AbstractArray{Bool},] action_samples::Int)
Sample action_samples
actions from each state. Returns a 3D tensor with dimensions (action_size x action_samples x batchsize)
. Always returns the logits of each action along in a tensor with the same dimensions. The optional argument mask
must be an Array of Bool
with the same size as state
expect for the first dimension that must have the length of the action vector. Actions mapped to false
by mask have a logit equal to -Inf
and/or a zero-probability of being sampled.
ReinforcementLearningCore.CovGaussianNetwork
— TypeCovGaussianNetwork(;pre=identity, μ, Σ)
Returns μ
and Σ
when called where μ is the mean and Σ is a covariance matrix. Unlike GaussianNetwork, the output is 3-dimensional. μ has dimensions (action_size x 1 x batchsize)
and Σ has dimensions (action_size x action_size x batchsize)
. The Σ head of the CovGaussianNetwork
should not directly return a square matrix but a vector of length action_size x (action_size + 1) ÷ 2
. This vector will contain elements of the uppertriangular cholesky decomposition of the covariance matrix, which is then reconstructed from it. Sample from MvNormal.(μ, Σ)
.
ReinforcementLearningCore.CovGaussianNetwork
— Method(model::CovGaussianNetwork)(state::AbstractArray, action::AbstractArray)
Return the logpdf of the model sampling action
when in state
. State must be a 3D tensor with dimensions (state_size x 1 x batchsize)
. Multiple actions may be taken per state, action
must have dimensions (action_size x action_samples_per_state x batchsize)
. Returns a 3D tensor with dimensions (1 x action_samples_per_state x batchsize)
.
ReinforcementLearningCore.CovGaussianNetwork
— MethodIf given 2D matrices as input, will return a 2D matrix of logpdf. States and actions are paired column-wise, one action per state.
ReinforcementLearningCore.CovGaussianNetwork
— Method(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)
Sample action_samples
actions per state in state
and return the actions, logpdf(actions)
. This function is compatible with a multidimensional action space. The outputs are 3D tensors with dimensions (action_size x action_samples x batchsize)
and (1 x action_samples x batchsize)
for actions
and logdpf
respectively.
ReinforcementLearningCore.CovGaussianNetwork
— Method(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}; is_sampling::Bool=false, is_return_log_prob::Bool=false)
This function is compatible with a multidimensional action space. To work with covariance matrices, the outputs are 3D tensors. If sampling, return an actions tensor with dimensions (action_size x action_samples x batchsize)
and a logp_π
tensor with dimensions (1 x action_samples x batchsize)
. If not sampling, returns μ
with dimensions (action_size x 1 x batchsize)
and L
, the lower triangular of the cholesky decomposition of the covariance matrix, with dimensions (action_size x action_size x batchsize)
The covariance matrices can be retrieved with Σ = stack(map(l -> l*l', eachslice(L, dims=3)); dims=3)
rng::AbstractRNG=Random.default_rng()
is_sampling::Bool=false
, whether to sample from the obtained normal distribution.is_return_log_prob::Bool=false
, whether to calculate the conditional probability of getting actions in the given state.
ReinforcementLearningCore.CovGaussianNetwork
— Method(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractMatrix; is_sampling::Bool=false, is_return_log_prob::Bool=false)
Given a Matrix of states, will return actions, μ and logpdf in matrix format. The batch of Σ remains a 3D tensor.
ReinforcementLearningCore.CurrentPlayerIterator
— TypeCurrentPlayerIterator(env::E) where {E<:AbstractEnv}
CurrentPlayerIterator
is an iterator that iterates over the players in the environment, returning the
currentplayer`for each iteration. This is only necessary for
MultiAgentenvironments. After each iteration,
RLBase.nextplayer!is called to advance the
currentplayer. As long as
`RLBase.nextplayer!is defined for the environment, this iterator will work correctly in the
Base.run`` function.
ReinforcementLearningCore.DoEveryNEpisodes
— TypeDoEveryNEpisodes(f; n=1, t=0)
Execute f(t, agent, env)
every n
episode. t
is a counter of episodes.
ReinforcementLearningCore.DoEveryNSteps
— TypeDoEveryNSteps(f; n=1, t=0)
Execute f(t, agent, env)
every n
step. t
is a counter of steps.
ReinforcementLearningCore.DoOnExit
— TypeDoOnExit(f)
Call the lambda function f
at the end of an Experiment
.
ReinforcementLearningCore.DuelingNetwork
— TypeDuelingNetwork(;base, val, adv)
Dueling network automatically produces separate estimates of the state value function network and advantage function network. The expected output size of val is 1, and adv is the size of the action space.
ReinforcementLearningCore.EmptyHook
— TypeNothing but a placeholder.
ReinforcementLearningCore.EpsilonGreedyExplorer
— TypeEpsilonGreedyExplorer{T}(;kwargs...)
EpsilonGreedyExplorer(ϵ) -> EpsilonGreedyExplorer{:linear}(; ϵ_stable = ϵ)
Epsilon-greedy strategy: The best lever is selected for a proportion
1 - epsilon
of the trials, and a lever is selected at random (with uniform probability) for a proportion epsilon . Multi-armed_bandit
Two kinds of epsilon-decreasing strategy are implemented here (linear
and exp
).
Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of epsilon decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish. - Multi-armed_bandit
Keywords
T::Symbol
: defines how to calculate the epsilon in the warmup steps. Supported values arelinear
andexp
.step::Int = 1
: record the current step.ϵ_init::Float64 = 1.0
: initial epsilon.warmup_steps::Int=0
: the number of steps to useϵ_init
.decay_steps::Int=0
: the number of steps for epsilon to decay fromϵ_init
toϵ_stable
.ϵ_stable::Float64
: the epsilon afterwarmup_steps + decay_steps
.is_break_tie=false
: randomly select an action of the same maximum values if set totrue
.rng=Random.default_rng()
: set the internal RNG.
Example
s_lin = EpsilonGreedyExplorer(kind=:linear, ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot([RLCore.get_ϵ(s_lin, i) for i in 1:500], label="linear epsilon")
s_exp = EpsilonGreedyExplorer(kind=:exp, ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot!([RLCore.get_ϵ(s_exp, i) for i in 1:500], label="exp epsilon")
ReinforcementLearningCore.Experiment
— TypeExperiment(policy::AbstractPolicy, env::AbstractEnv, stop_condition::AbstractStopCondition, hook::AbstractHook)
A struct to hold the information of an experiment. It is used to run an experiment with the given policy, environment, stop condition and hook.
ReinforcementLearningCore.FluxApproximator
— TypeFluxApproximator(model, optimiser)
Wraps a Flux trainable model and implements the RLBase.optimise!(::FluxApproximator, ::Gradient)
interface. See the RLCore documentation for more information on proper usage.
ReinforcementLearningCore.FluxApproximator
— MethodFluxApproximator(; model, optimiser, usegpu=false)
Constructs an FluxApproximator
object for reinforcement learning.
Arguments
model
: The model used for approximation.optimiser
: The optimizer used for updating the model.usegpu
: A boolean indicating whether to use GPU for computation. Default isfalse
.
Returns
An FluxApproximator
object.
ReinforcementLearningCore.GaussianNetwork
— Method(model::GaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)
Sample action_samples
actions from each state. Returns a 3D tensor with dimensions (action_size x action_samples x batchsize)
. state
must be 3D tensor with dimensions (state_size x 1 x batchsize)
. Always returns the logpdf of each action along.
ReinforcementLearningCore.GaussianNetwork
— MethodThis function is compatible with a multidimensional action space.
rng::AbstractRNG=Random.default_rng()
is_sampling::Bool=false
, whether to sample from the obtained normal distribution.is_return_log_prob::Bool=false
, whether to calculate the conditional probability of getting actions in the given state.
ReinforcementLearningCore.MultiAgentHook
— TypeMultiAgentHook(hooks::NT) where {NT<: NamedTuple}
MultiAgentHook is a hook struct that contains <:AbstractHook
structs indexed by the player's symbol.
ReinforcementLearningCore.MultiAgentPolicy
— TypeMultiAgentPolicy(agents::NT) where {NT<: NamedTuple}
MultiAgentPolicy is a policy struct that contains <:AbstractPolicy
structs indexed by the player's symbol.
ReinforcementLearningCore.OfflineAgent
— TypeOfflineAgent(policy::AbstractPolicy, trajectory::Trajectory, offline_behavior::OfflineBehavior = OfflineBehavior()) <: AbstractAgent
OfflineAgent
is an AbstractAgent
that, unlike the usual online Agent
, does not interact with the environment during training in order to collect data. Just like Agent
, it contains an AbstractPolicy
to be trained an a Trajectory
that contains the training data. The difference being that the trajectory is filled prior to training and is not updated. An OfflineBehavior
can optionally be provided to provide an second "behavior agent" that will generate the training data at the PreExperimentStage
. Does nothing by default.
ReinforcementLearningCore.OfflineBehavior
— TypeOfflineBehavior(; agent:: Union{<:Agent, Nothing}, steps::Int, reset_condition)
Used to provide an OfflineAgent with a "behavior agent" that will generate the training data at the PreExperimentStage
. If agent
is nothing
(by default), does nothing. The trajectory
of agent should be the same as that of the parent OfflineAgent
. steps
is the number of data elements to generate, defaults to the capacity of the trajectory. reset_condition
is the episode reset condition for the data generation (defaults to ResetIfEnvTerminated()
).
The behavior agent will interact with the main environment of the experiment to generate the data.
ReinforcementLearningCore.PerturbationNetwork
— MethodThis function accepts state
and action
, and then outputs actions after disturbance.
ReinforcementLearningCore.PlayerTuple
— TypePlayerTuple
A NamedTuple that maps players to their respective values.
ReinforcementLearningCore.PostActStage
— TypeStage that is executed after the Agent
acts.
ReinforcementLearningCore.PostEpisodeStage
— TypeStage that is executed after the Episode
is over.
ReinforcementLearningCore.PostExperimentStage
— TypeStage that is executed after the Experiment
is over.
ReinforcementLearningCore.PreActStage
— TypeStage that is executed before the Agent
acts.
ReinforcementLearningCore.PreEpisodeStage
— TypeStage that is executed before the Episode
starts.
ReinforcementLearningCore.PreExperimentStage
— TypeStage that is executed before the Experiment
starts.
ReinforcementLearningCore.QBasedPolicy
— TypeQBasedPolicy(;learner, explorer)
Wraps a learner and an explorer. The learner is a struct that should predict the Q-value of each legal action of an environment at its current state. It is typically a table or a neural network. QBasedPolicy can be queried for an action with RLBase.plan!
, the explorer will affect the action selection accordingly.
ReinforcementLearningCore.RandomPolicy
— TypeRandomPolicy(action_space=nothing; rng=Random.default_rng())
If action_space
is nothing
, then it will use the legal_action_space
at runtime to randomly select an action. Otherwise, a random element within action_space
is selected.
You should always set action_space=nothing
when dealing with environments of FULL_ACTION_SET
.
ReinforcementLearningCore.ResetAfterNSteps
— TypeResetAfterNSteps(n)
A reset condition that resets the environment after n
steps.
ReinforcementLearningCore.ResetIfEnvTerminated
— TypeResetIfEnvTerminated()
A reset condition that resets the environment if is_terminated(env) is true.
ReinforcementLearningCore.RewardsPerEpisode
— TypeRewardsPerEpisode(; rewards = Vector{Vector{Float64}}())
Store each reward of each step in every episode in the field of rewards
.
ReinforcementLearningCore.SoftGaussianNetwork
— TypeSoftGaussianNetwork(;pre=identity, μ, σ, min_σ=0f0, max_σ=Inf32, squash = tanh)
Like GaussianNetwork
but with a differentiable reparameterization trick. Mainly used for SAC. Returns μ
and σ
when called. Create a distribution to sample from using Normal.(μ, σ)
. min_σ
and max_σ
are used to clip the output from σ
. pre
is a shared body before the two heads of the NN. σ should be > 0. You may enforce this using a softplus
output activation. Actions are squashed by a tanh and a correction is applied to the logpdf.
ReinforcementLearningCore.SoftGaussianNetwork
— Method(model::SoftGaussianNetwork)(rng::AbstractRNG, state::AbstractArray{<:Any, 3}, action_samples::Int)
Sample action_samples
actions from each state. Returns a 3D tensor with dimensions (action_size x action_samples x batchsize)
. state
must be 3D tensor with dimensions (state_size x 1 x batchsize)
. Always returns the logpdf of each action along.
ReinforcementLearningCore.SoftGaussianNetwork
— MethodThis function is compatible with a multidimensional action space.
rng::AbstractRNG=Random.default_rng()
is_sampling::Bool=false
, whether to sample from the obtained normal distribution.is_return_log_prob::Bool=false
, whether to calculate the conditional probability of getting actions in the given state.
ReinforcementLearningCore.StackFrames
— TypeStackFrames(::Type{T}=Float32, d::Int...)
Use a pre-initialized CircularArrayBuffer
to store the latest several states specified by d
. Before processing any observation, the buffer is filled with `zero{T} by default.
ReinforcementLearningCore.StepsPerEpisode
— TypeStepsPerEpisode(; steps = Int[], count = 0)
Store steps of each episode in the field of steps
.
ReinforcementLearningCore.StopAfterNEpisodes
— TypeStopAfterNEpisodes(episode; cur = 0, is_show_progress = true)
Return true
after being called episode
. If is_show_progress
is true
, the ProgressMeter
will be used to show progress.
ReinforcementLearningCore.StopAfterNSeconds
— TypeStopAfterNSeconds
parameter:
- time budget
stop training after N seconds
ReinforcementLearningCore.StopAfterNSteps
— TypeStopAfterNSteps(step; cur = 1, is_show_progress = true)
Return true
after being called step
times.
ReinforcementLearningCore.StopAfterNoImprovement
— TypeStopAfterNoImprovement()
Stop training when a monitored metric has stopped improving.
Parameters:
fn: a closure, return a scalar value, which indicates the performance of the policy (the higher the better) e.g.
- () -> reward(env)
- () -> totalrewardper_episode.reward
patience: Number of epochs with no improvement after which training will be stopped.
δ: Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.
Return true
after the monitored metric has stopped improving.
ReinforcementLearningCore.StopIfAny
— TypeAnyStopCondition(stop_conditions...)
The result of stop_conditions
is reduced by any
.
ReinforcementLearningCore.StopIfEnvTerminated
— TypeStopIfEnvTerminated()
Return true
if the environment is terminated.
ReinforcementLearningCore.StopSignal
— TypeStopSignal()
Create a stop signal initialized with a value of false
. You can manually set it to true
by s[] = true
to stop the running loop at any time.
ReinforcementLearningCore.TDLearner
— TypeTDLearner(;approximator, method, γ=1.0, α=0.01, n=0)
Use temporal-difference method to estimate state value or state-action value.
Fields
approximator
is<:TabularApproximator
.γ=1.0
, discount rate.method
: only:SARS
(Q-learning) is supported for the time being.n=0
: the number of time steps used minus 1.
ReinforcementLearningCore.TabularApproximator
— MethodTabularApproximator(table<:AbstractArray)
For table
of 1-d, it will serve as a state value approximator. For table
of 2-d, it will serve as a state-action value approximator.
For table
of 2-d, the first dimension is action and the second dimension is state.
ReinforcementLearningCore.TabularQApproximator
— MethodTabularQApproximator(; n_state, n_action, init = 0.0)
Create a TabularQApproximator
with n_state
states and n_action
actions.
ReinforcementLearningCore.TargetNetwork
— TypeTargetNetwork(network::FluxApproximator; sync_freq::Int = 1, ρ::Float32 = 0f0)
Wraps an FluxApproximator to hold a target network that is updated towards the model of the approximator.
sync_freq
is the number of updates ofnetwork
between each update of thetarget
.- ρ ( ho) is "how much of the target is kept when updating it".
The two common usages of TargetNetwork are
- use ρ = 0 to totally replace
target
withnetwork
every sync_freq updates. - use ρ < 1 (but close to one) and sync_freq = 1 to let the target follow
network
with polyak averaging.
Implements the RLBase.optimise!(::TargetNetwork, ::Gradient)
interface to update the model with the gradient and the target with weights replacement or Polyak averaging.
Note to developers: model(::TargetNetwork)
will return the trainable Flux model and target(::TargetNetwork)
returns the target model and target(::FluxApproximator)
returns the non-trainable Flux model. See the RLCore documentation.
ReinforcementLearningCore.TargetNetwork
— MethodTargetNetwork(network; sync_freq = 1, ρ = 0f0, use_gpu = false)
Constructs a target network for reinforcement learning.
Arguments
network
: The main network used for training.sync_freq
: The frequency (in number of calls tooptimise!
) at which the target network is synchronized with the main network. Default is 1.ρ
: The interpolation factor used for updating the target network. Must be in the range [0, 1]. Default is 0 (the old weights are completely replaced by the new ones).use_gpu
: Specifies whether to use GPU for the target network. Default isfalse
.
Returns
A TargetNetwork
object.
ReinforcementLearningCore.TimePerStep
— TypeTimePerStep(;max_steps=100)
TimePerStep(times::CircularVectorBuffer{Float64}, t::Float64)
Store time cost in seconds of the latest max_steps
in the times
field.
ReinforcementLearningCore.TotalRewardPerEpisode
— TypeTotalRewardPerEpisode(; is_display_on_exit = true)
Store the total reward of each episode in the field of rewards
. If is_display_on_exit
is set to true
, a unicode plot will be shown at the PostExperimentStage
.
ReinforcementLearningCore.UCBExplorer
— MethodUCBExplorer(na; c=2.0, ϵ=1e-10, step=1, seed=nothing)
Arguments
na
is the number of actions used to create a internal counter.t
is used to store current time step.c
is used to control the degree of exploration.seed
, set the seed of inner RNG.
ReinforcementLearningCore.VAE
— TypeVAE(;encoder, decoder, latent_dims)
ReinforcementLearningCore.WeightedExplorer
— TypeWeightedExplorer(;is_normalized::Bool, rng=Random.default_rng())
is_normalized
is used to indicate if the fed action values are already normalized to have a sum of 1.0
.
Elements are assumed to be >=0
.
See also: WeightedSoftmaxExplorer
ReinforcementLearningCore.WeightedSoftmaxExplorer
— TypeWeightedSoftmaxExplorer(;rng=Random.default_rng())
See also: WeightedExplorer
Base.push!
— MethodWhen pushing a StackFrames
into a CircularArrayBuffer
of the same dimension, only the latest frame is pushed. If the StackFrames
is one dimension lower, then it is treated as a general AbstractArray
and is pushed in as a frame.
Base.run
— MethodBase.run(
multiagent_policy::MultiAgentPolicy,
env::E,
stop_condition,
hook::MultiAgentHook,
reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}
This run function dispatches games using MultiAgentPolicy
and MultiAgentHook
to the appropriate run
function based on the Sequential
or Simultaneous
trait of the environment.
Base.run
— MethodBase.run(
multiagent_policy::MultiAgentPolicy,
env::E,
::Sequential,
stop_condition,
hook::MultiAgentHook,
reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}
This run function handles MultiAgent
games with the Sequential
trait. It iterates over the current_player
for each turn in the environment, and runs the full run
loop, like in the SingleAgent
case. If the stop_condition
is met, the function breaks out of the loop and calls optimise!
on the policy again. Finally, it calls optimise!
on the policy one last time and returns the MultiAgentHook
.
Base.run
— MethodBase.run(
multiagent_policy::MultiAgentPolicy,
env::E,
::Simultaneous,
stop_condition,
hook::MultiAgentHook,
reset_condition,
) where {E<:AbstractEnv, H<:AbstractHook}
This run function handles MultiAgent
games with the Simultaneous
trait. It iterates over the players in the environment, and for each player, it selects the appropriate policy from the MultiAgentPolicy
. All agent actions are collected before the environment is updated. After each player has taken an action, it calls optimise!
on the policy. If the stop_condition
is met, the function breaks out of the loop and calls optimise!
on the policy again. Finally, it calls optimise!
on the policy one last time and returns the MultiAgentHook
.
ReinforcementLearningBase.plan!
— MethodRLBase.plan!(x::BatchExplorer, values::AbstractMatrix)
Apply inner explorer to each column of values
.
ReinforcementLearningBase.plan!
— MethodRLBase.plan!(s::EpsilonGreedyExplorer, values; step) where T
If multiple values with the same maximum value are found. Then a random one will be returned when is_break_tie==true
.
NaN
will be filtered unless all the values are NaN
. In that case, a random one will be returned.
ReinforcementLearningBase.prob
— Methodprob(p::AbstractExplorer, x, mask)
Similar to prob(p::AbstractExplorer, x)
, but here only the mask
ed elements are considered.
ReinforcementLearningBase.prob
— Methodprob(p::AbstractExplorer, x) -> AbstractDistribution
Get the action distribution given action values.
ReinforcementLearningBase.prob
— Methodprob(s::EpsilonGreedyExplorer, values) -> Categorical
prob(s::EpsilonGreedyExplorer, values, mask) -> Categorical
Return the probability of selecting each action given the estimated values
of each action.
ReinforcementLearningCore._discount_rewards!
— Methodassuming rewards and new_rewards are Vector
ReinforcementLearningCore._generalized_advantage_estimation!
— Methodassuming rewards and advantages are Vector
ReinforcementLearningCore.bellman_update!
— Methodbellman_update!(app::TabularApproximator, s::Int, s_plus_one::Int, a::Int, α::Float64, π_::Float64, γ::Float64)
Update the Q-value of the given state-action pair.
ReinforcementLearningCore.check
— MethodInject some customized checkings here by overwriting this function
ReinforcementLearningCore.cholesky_matrix_to_vector_index
— Methodcholesky_matrix_to_vector_index(i, j)
Return the position in a cholesky_vec (of length da) of the element of the lower triangular matrix at coordinates (i,j).
For example if cholesky_vec = [1,2,3,4,5,6]
, the corresponding lower triangular matrix is
L = [1 0 0
2 4 0
3 5 6]
and cholesky_matrix_to_vector_index(3, 2) == 5
ReinforcementLearningCore.diagnormkldivergence
— Methoddiagnormkldivergence(μ1, σ1, μ2, σ2)
GPU differentiable implementation of the kl_divergence between two MultiVariate Gaussian distributions with mean vectors μ1, μ2
respectively and diagonal standard deviations σ1, σ2
. Arguments must be Vectors or arrays of column vectors.
ReinforcementLearningCore.diagnormlogpdf
— Methoddiagnormlogpdf(μ, σ, x; ϵ = 1.0f-8)
GPU compatible and automatically differentiable version for the logpdf function of normal distributions with diagonal covariance. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer). Accepts arguments of the same shape: vectors, matrices or 3D array (with dimension 2 of size 1).
ReinforcementLearningCore.discount_rewards
— Methoddiscount_rewards(rewards::VectorOrMatrix, γ::Number;kwargs...)
Calculate the gain started from the current step with discount rate of γ
. rewards
can be a matrix.
Keyword arguments
dims=:
, ifrewards
is aMatrix
, thendims
can only be1
or2
.terminal=nothing
, specify if each reward follows by a terminal.nothing
means the game is not terminated yet. Ifterminal
is provided, then the size must be the same withrewards
.init=nothing
,init
can be used to provide the the reward estimation of the last state.
Example
ReinforcementLearningCore.flatten_batch
— Methodflatten_batch(x::AbstractArray)
Merge the last two dimension.
Example
julia> x = reshape(1:12, 2, 2, 3)
2×2×3 reshape(::UnitRange{Int64}, 2, 2, 3) with eltype Int64:
[:, :, 1] =
1 3
2 4
[:, :, 2] =
5 7
6 8
[:, :, 3] =
9 11
10 12
julia> flatten_batch(x)
2×6 reshape(::UnitRange{Int64}, 2, 6) with eltype Int64:
1 3 5 7 9 11
2 4 6 8 10 12
ReinforcementLearningCore.generalized_advantage_estimation
— Methodgeneralized_advantage_estimation(rewards::VectorOrMatrix, values::VectorOrMatrix, γ::Number, λ::Number;kwargs...)
Calculate the generalized advantage estimate started from the current step with discount rate of γ
and a lambda for GAE-Lambda of 'λ'. rewards
and 'values' can be a matrix.
Keyword arguments
dims=:
, ifrewards
is aMatrix
, thendims
can only be1
or2
.terminal=nothing
, specify if each reward follows by a terminal.nothing
means the game is not terminated yet. Ifterminal
is provided, then the size must be the same withrewards
.
Example
ReinforcementLearningCore.logdetLorU
— MethodlogdetLorU(LorU::AbstractMatrix)
Log-determinant of the Positive-Semi-Definite matrix A = L*U (cholesky lower and upper triangulars), given L or U. Has a sign uncertainty for non PSD matrices.
ReinforcementLearningCore.mvnormkldivergence
— Methodmvnormkldivergence(μ1, L1, μ2, L2)
GPU differentiable implementation of the kl_divergence between two MultiVariate Gaussian distributions with mean vectors μ1, μ2
respectively and with cholesky decomposition of covariance matrices L1, L2
.
ReinforcementLearningCore.mvnormlogpdf
— Methodmvnormlogpdf(μ::AbstractVecOrMat, L::AbstractMatrix, x::AbstractVecOrMat)
GPU compatible and automatically differentiable version for the logpdf function of multivariate normal distributions. Takes as inputs mu
the mean vector, L
the lower triangular matrix of the cholesky decomposition of the covariance matrix, and x
a matrix of samples where each column is a sample. Return a Vector containing the logpdf of each column of x for the MvNormal
parametrized by μ
and Σ = L*L'
.
ReinforcementLearningCore.mvnormlogpdf
— Methodmvnormlogpdf(μ::A, LorU::A, x::A; ϵ = 1f-8) where A <: AbstractArray
Batch version that takes 3D tensors as input where each slice along the 3rd dimension is a batch sample. μ
is a (actionsize x 1 x batchsize) matrix, L
is a (actionsize x actionsize x batchsize), x is a (actionsize x actionsamples x batchsize). Return a 3D matrix of size (1 x actionsamples x batchsize).
ReinforcementLearningCore.normkldivergence
— Methodnormkldivergence(μ1, σ1, μ2, σ2)
GPU differentiable implementation of the kl_divergence between two univariate Gaussian distributions with means μ1, μ2
and standard deviations σ1, σ2
respectively.
ReinforcementLearningCore.normlogpdf
— Method normlogpdf(μ, σ, x; ϵ = 1.0f-8)
GPU automatic differentiable version for the logpdf function of a univariate normal distribution. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer).
ReinforcementLearningCore.vec_to_tril
— MethodTransform a vector containing the non-zero elements of a lower triangular da x da matrix into that matrix.
In addition to containing the run loop, RLCore is a collection of pre-implemented components that are frequently used in RL.
QBasedPolicy
QBasedPolicy
is an AbstractPolicy
that wraps a Q-Value learner (tabular or approximated) and an explorer. Use this wrapper to implement a policy that directly uses a Q-value function to decide its next action. In that case, instead of creating an AbstractPolicy
subtype for your algorithm, define an AbstractLearner
subtype and specialize RLBase.optimise!(::YourLearnerType, ::Stage, ::Trajectory)
. This way you will not have to code the interaction between your policy and the explorer yourself. RLCore provides the most common explorers (such as epsilon-greedy, UCB, etc.). You can find many examples of QBasedPolicies in the DQNs section of RLZoo.
Parametric approximators
Approximator
If your algorithm uses a neural network or a linear approximator to approximate a function trained with Flux.jl
, use the Approximator
. It wraps a Flux
model and an Optimiser
(such as Adam or SGD). Your optimise!(::PolicyOrLearner, batch)
function will probably consist in computing a gradient and call the RLBase.optimise!(app::Approximator, gradient::Flux.Grads)
after that.
Approximator
implements the model(::Approximator)
and target(::Approximator)
interface. Both return the underlying Flux model. The advantage of this interface is explained in the TargetNetwork
section below.
TargetNetwork
The use of a target network is frequent in state or action value-based RL. The principle is to hold a copy of of the main approximator, which is trained using a gradient, and a copy of it that is either only partially updated, or just less frequently updated. TargetNetwork
is constructed by wrapping an Approximator
. Set the sync_freq
keyword argument to a value greater that one to copy the main model into the target every sync_freq
updates, or set the \rho
parameter to a value greater than 0 (usually 0.99f0) to let the target be partially updated towards the main model every update. RLBase.optimise!(tn::TargetNetwork, gradient::Flux.Grads)
will take care of updating the target for you.
The other advantage of TargetNetwork
is that it uses Julia's multiple dispatch to let your algorithm be agnostic to the presence or absence of a target network. For example, the DQNLearner
in RLZoo has an approximator
field typed to be a Union{Approximator, TargetNetwork}
. When computing the temporal difference error, the learner calls Q = model(learner.approximator)
and Qt = target(learner.approximator)
. If learner.approximator
is a Approximator
, then no target network is used because both calls point to the same neural network, if it is a TargetNetwork
then the automatically managed target is returned.
Architectures
Common model architectures are also provided such as the GaussianNetwork
for continuous policies with diagonal multivariate policies; and CovGaussianNetwork
for full covariance (very slow on GPUs at the moment).