ReinforcementLearningCore.jl
ReinforcementLearningCore.RLCore
— ModuleReinforcementLearningCore.jl (RLCore) provides some standard and reusable components defined by RLBase, hoping that they are useful for people to implement and experiment with different kinds of algorithms.
ReinforcementLearningCore.AbstractApproximator
— Type(app::AbstractApproximator)(env)
An approximator is a functional object for value estimation. It serves as a black box to provides an abstraction over different kinds of approximate methods (for example DNN provided by Flux or Knet).
ReinforcementLearningCore.AbstractExplorer
— Type(p::AbstractExplorer)(x)
(p::AbstractExplorer)(x, mask)
Define how to select an action based on action values.
ReinforcementLearningCore.AbstractHook
— TypeA hook is called at different stage duiring a run
to allow users to inject customized runtime logic. By default, a AbstractHook
will do nothing. One can override the behavior by implementing the following methods:
(hook::YourHook)(::PreActStage, agent, env, action)
, note that there's an extra argument ofaction
.(hook::YourHook)(::PostActStage, agent, env)
(hook::YourHook)(::PreEpisodeStage, agent, env)
(hook::YourHook)(::PostEpisodeStage, agent, env)
(hook::YourHook)(::PostExperimentStage, agent, env)
ReinforcementLearningCore.AbstractLearner
— Type(learner::AbstractLearner)(env)
A learner is usually used to estimate state values, state-action values or distributional values based on experiences.
ReinforcementLearningCore.AbstractTrajectory
— TypeAbstractTrajectory
A trajectory is used to record some useful information during the interactions between agents and environments. It behaves similar to a NamedTuple
except that we extend it with some optional methods.
Required Methods:
Base.getindex
Base.keys
Optional Methods:
Base.length
Base.isempty
Base.empty!
Base.haskey
Base.push!
Base.pop!
ReinforcementLearningCore.ActorCritic
— TypeActorCritic(;actor, critic, optimizer=ADAM())
The actor
part must return logits (Do not use softmax in the last layer!), and the critic
part must return a state value.
ReinforcementLearningCore.Agent
— TypeAgent(;kwargs...)
A wrapper of an AbstractPolicy
. Generally speaking, it does nothing but to update the trajectory and policy appropriately in different stages.
Keywords & Fields
policy
::AbstractPolicy
: the policy to usetrajectory
::AbstractTrajectory
: used to store transitions between an agent and an environment
ReinforcementLearningCore.Agent
— MethodHere we extend the definition of (p::AbstractPolicy)(::AbstractEnv)
in RLBase
to accept an AbstractStage
as the first argument. Algorithm designers may customize these behaviors respectively by implementing:
(p::YourPolicy)(::AbstractStage, ::AbstractEnv)
(p::YourPolicy)(::PreActStage, ::AbstractEnv, action)
The default behaviors for Agent
are:
Update the inner
trajectory
given the context ofpolicy
,env
, andstage
.By default we do nothing.
In
PreActStage
, wepush!
the current state and the action into thetrajectory
.In
PostActStage
, we query thereward
andis_terminated
info fromenv
and push them intotrajectory
.In the
PosEpisodeStage
, we push thestate
at the end of an episode and a dummy action into thetrajectory
.In the
PreEpisodeStage
, we pop out the lateststate
andaction
pair (which are dummy ones) fromtrajectory
.Update the inner
policy
given the context oftrajectory
,env
, andstage
.By default, we only
update!
thepolicy
in thePreActStage
. And it's dispatched toupdate!(policy, trajectory, env, stage)
.
ReinforcementLearningCore.BatchExplorer
— TypeBatchExplorer(explorer::AbstractExplorer)
ReinforcementLearningCore.BatchExplorer
— Method(x::BatchExplorer)(values::AbstractMatrix)
Apply inner explorer to each column of values
.
ReinforcementLearningCore.BatchStepsPerEpisode
— MethodBatchStepsPerEpisode(batch_size::Int; tag = "TRAINING")
Similar to StepsPerEpisode
, but is specific to environments which return a Vector
of rewards (a typical case with MultiThreadEnv
).
ReinforcementLearningCore.CircularArraySARTTrajectory
— MethodCircularArraySARTTrajectory(;capacity::Int, kw...)
A specialized CircularArrayTrajectory
with traces of SART
. Note that the capacity of the :state
and :action
trace is one step longer than the capacity of the :reward
and :terminal
trace, so that we can reuse the same trace to represent the next state and next action in a typical transition in reinforcement learning.
Keyword arguments
capacity::Int
, the maximum number of transitions.state::Pair{<:DataType, <:Tuple{Vararg{Int}}}
=Int => ()
,action::Pair{<:DataType, <:Tuple{Vararg{Int}}}
=Int => ()
,reward::Pair{<:DataType, <:Tuple{Vararg{Int}}}
=Float32 => ()
,terminal::Pair{<:DataType, <:Tuple{Vararg{Int}}}
=Bool => ()
,
Example
julia> t = CircularArraySARTTrajectory(;
capacity = 3,
state = Vector{Int} => (4,),
action = Int => (),
reward = Float32 => (),
terminal = Bool => (),
)
Trajectory of 4 traces:
:state 4×0 CircularArrayBuffers.CircularArrayBuffer{Int64, 2}
:action 0-element CircularArrayBuffers.CircularVectorBuffer{Int64}
:reward 0-element CircularArrayBuffers.CircularVectorBuffer{Float32}
:terminal 0-element CircularArrayBuffers.CircularVectorBuffer{Bool}
julia> for i in 1:4
push!(t;state=ones(Int, 4) .* i, action = i, reward=i/2, terminal=iseven(i))
end
julia> push!(t;state=ones(Int,4) .* 5, action = 5)
julia> t[:state]
4×4 CircularArrayBuffers.CircularArrayBuffer{Int64, 2}:
2 3 4 5
2 3 4 5
2 3 4 5
2 3 4 5
julia> t[:action]
4-element CircularArrayBuffers.CircularVectorBuffer{Int64}:
2
3
4
5
julia> t[:reward]
3-element CircularArrayBuffers.CircularVectorBuffer{Float32}:
1.0
1.5
2.0
julia> t[:terminal]
3-element CircularArrayBuffers.CircularVectorBuffer{Bool}:
1
0
1
ReinforcementLearningCore.CircularArraySLARTTrajectory
— MethodSimilar to CircularArraySARTTrajectory
with an extra legal_actions_mask
trace.
ReinforcementLearningCore.CircularVectorSARTSATrajectory
— MethodSimilar to CircularVectorSARTTrajectory
with another two traces of (:next_state, :next_action)
ReinforcementLearningCore.CircularVectorSARTTrajectory
— MethodCircularVectorSARTTrajectory(;capacity, kw::DataType...)
A specialized CircularVectorTrajectory
with traces of SART
. Note that the capacity of traces :state
and :action
are one step longer than the traces of :reward
and :terminal
, so that we can reuse the same underlying storage to represent the next state and next action in a typical transition in reinforcement learning.
Keyword arguments
capacity::Int
state
=Int
,action
=Int
,reward
=Float32
,terminal
=Bool
,
Example
julia> t = CircularVectorSARTTrajectory(;
capacity = 3,
state = Vector{Int},
action = Int,
reward = Float32,
terminal = Bool,
)
Trajectory of 4 traces:
:state 0-element CircularArrayBuffers.CircularVectorBuffer{Vector{Int64}}
:action 0-element CircularArrayBuffers.CircularVectorBuffer{Int64}
:reward 0-element CircularArrayBuffers.CircularVectorBuffer{Float32}
:terminal 0-element CircularArrayBuffers.CircularVectorBuffer{Bool}
julia> for i in 1:4
push!(t;state=ones(Int, 4) .* i, action = i, reward=i/2, terminal=iseven(i))
end
julia> push!(t;state=ones(Int,4) .* 5, action = 5)
julia> t[:state]
4-element CircularArrayBuffers.CircularVectorBuffer{Vector{Int64}}:
[2, 2, 2, 2]
[3, 3, 3, 3]
[4, 4, 4, 4]
[5, 5, 5, 5]
julia> t[:action]
4-element CircularArrayBuffers.CircularVectorBuffer{Int64}:
2
3
4
5
julia> t[:reward]
3-element CircularArrayBuffers.CircularVectorBuffer{Float32}:
1.0
1.5
2.0
julia> t[:terminal]
3-element CircularArrayBuffers.CircularVectorBuffer{Bool}:
1
0
1
ReinforcementLearningCore.ComposedHook
— TypeComposedHook(hooks::AbstractHook...)
Compose different hooks into a single hook.
ReinforcementLearningCore.ComposedStopCondition
— TypeComposedStopCondition(stop_conditions...; reducer = any)
The result of stop_conditions
is reduced by reducer
.
ReinforcementLearningCore.CovGaussianNetwork
— TypeCovGaussianNetwork(;pre=identity, μ, Σ, normalizer = tanh)
Returns μ
and Σ
when called where μ is the mean and Σ is a covariance matrix. Unlike GaussianNetwork, the output is 3-dimensional. μ has dimensions (actionsize x 1 x batchsize) and Σ has dimensions (actionsize x actionsize x batchsize). The Σ head of the CovGaussianNetwork
should not directly return a square matrix but a vector of length `actionsize x (action_size + 1) ÷ 2. This vector will contain elements of the uppertriangular cholesky decomposition of the covariance matrix, which is then reconstructed from it. Sample from
MvNormal.(μ, Σ)`. Actions are normalized elementwise according to the specified normalizer function.
ReinforcementLearningCore.CovGaussianNetwork
— Method(model::CovGaussianNetwork)(state, action)
Return the logpdf of the model sampling action
when in state
. State must be a 3D tensor with dimensions (statesize x 1 x batchsize). Multiple actions may be taken per state, action
must have dimensions (actionsize x actionsamplesperstate x batchsize) Returns a 3D tensor with dimensions (1 x actionsamplesperstate x batch_size)
ReinforcementLearningCore.CovGaussianNetwork
— MethodIf given 2D matrices as input, will return a 2D matrix of logpdf. States and actions are paired column-wise, one action per state.
ReinforcementLearningCore.CovGaussianNetwork
— Method(model::CovGaussianNetwork)(rng::AbstractRNG, state::AbstractMatrix; is_sampling::Bool=false, is_return_log_prob::Bool=false)
Given a Matrix of states, will return actions, μ and logpdf in matrix format. The batch of Σ remains a 3D tensor.
ReinforcementLearningCore.CovGaussianNetwork
— Method(model::CovGaussianNetwork)(rng::AbstractRNG, state, action_samples::Int)
Sample action_samples
actions given state
and return the actions, logpdf(actions)
. This function is compatible with a multidimensional action space. When outputting a sampled action, it uses the normalizer
function to normalize it elementwise. The outputs are 3D tensors with dimensions (actionsize x actionsamples x batchsize) and (1 x actionsamples x batch_size) for actions
and logdpf
respectively.
ReinforcementLearningCore.CovGaussianNetwork
— Method(model::CovGaussianNetwork)(rng::AbstractRNG, state; is_sampling::Bool=false, is_return_log_prob::Bool=false)
This function is compatible with a multidimensional action space. When outputting a sampled action, it uses the normalizer
function to normalize it elementwise. To work with covariance matrices, the outputs are 3D tensors. If sampling, return an actions tensor with dimensions (actionsize x actionsamples x batchsize) and logpπ (1 x actionsamples x batchsize) If not, returns μ with dimensions (actionsize x 1 x batchsize) and L, the lower triangular of the cholesky decomposition of the covariance matrix, with dimensions (actionsize x actionsize x batch_size) The covariance matrices can be retrieved with Σ = Flux.stack(map(l -> l*l', eachslice(L, dims=3)),3)
rng::AbstractRNG=Random.GLOBAL_RNG
is_sampling::Bool=false
, whether to sample from the obtained normal distribution.is_return_log_prob::Bool=false
, whether to calculate the conditional probability of getting actions in the given state.
ReinforcementLearningCore.DoEveryNEpisode
— TypeDoEveryNEpisode(f; n=1, t=0)
Execute f(t, agent, env)
every n
episode. t
is a counter of episodes.
ReinforcementLearningCore.DoEveryNStep
— TypeDoEveryNStep(f; n=1, t=0)
Execute f(t, agent, env)
every n
step. t
is a counter of steps.
ReinforcementLearningCore.DoOnExit
— TypeDoOnExit(f)
Call the lambda function f
at the end of an Experiment
.
ReinforcementLearningCore.DuelingNetwork
— TypeDuelingNetwork(;base, val, adv)
Dueling network automatically produces separate estimates of the state value function network and advantage function network. The expected output size of val is 1, and adv is the size of the action space.
ReinforcementLearningCore.ElasticSARTTrajectory
— MethodElasticSARTTrajectory(;kw...)
A specialized ElasticArrayTrajectory
with traces of SART
.
Keyword arguments
state::Pair{<:DataType, <:Tuple{Vararg{Int}}}
=Int => ()
, by default it means the state is a scalar ofInt
.action::Pair{<:DataType, <:Tuple{Vararg{Int}}}
=Int => ()
,reward::Pair{<:DataType, <:Tuple{Vararg{Int}}}
=Float32 => ()
,terminal::Pair{<:DataType, <:Tuple{Vararg{Int}}}
=Bool => ()
,
Example
julia> t = ElasticSARTTrajectory(;
state = Vector{Int} => (4,),
action = Int => (),
reward = Float32 => (),
terminal = Bool => (),
)
Trajectory of 4 traces:
:state 4×0 ElasticArrays.ElasticMatrix{Int64, Vector{Int64}}
:action 0-element ElasticArrays.ElasticVector{Int64, Vector{Int64}}
:reward 0-element ElasticArrays.ElasticVector{Float32, Vector{Float32}}
:terminal 0-element ElasticArrays.ElasticVector{Bool, Vector{Bool}}
julia> for i in 1:4
push!(t;state=ones(Int, 4) .* i, action = i, reward=i/2, terminal=iseven(i))
end
julia> push!(t;state=ones(Int,4) .* 5, action = 5)
julia> t
Trajectory of 4 traces:
:state 4×5 ElasticArrays.ElasticMatrix{Int64, Vector{Int64}}
:action 5-element ElasticArrays.ElasticVector{Int64, Vector{Int64}}
:reward 4-element ElasticArrays.ElasticVector{Float32, Vector{Float32}}
:terminal 4-element ElasticArrays.ElasticVector{Bool, Vector{Bool}}
julia> t[:state]
4×5 ElasticArrays.ElasticMatrix{Int64, Vector{Int64}}:
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
julia> t[:action]
5-element ElasticArrays.ElasticVector{Int64, Vector{Int64}}:
1
2
3
4
5
julia> t[:reward]
4-element ElasticArrays.ElasticVector{Float32, Vector{Float32}}:
0.5
1.0
1.5
2.0
julia> t[:terminal]
4-element ElasticArrays.ElasticVector{Bool, Vector{Bool}}:
0
1
0
1
julia> empty!(t)
julia> t
Trajectory of 4 traces:
:state 4×0 ElasticArrays.ElasticMatrix{Int64, Vector{Int64}}
:action 0-element ElasticArrays.ElasticVector{Int64, Vector{Int64}}
:reward 0-element ElasticArrays.ElasticVector{Float32, Vector{Float32}}
:terminal 0-element ElasticArrays.ElasticVector{Bool, Vector{Bool}}
ReinforcementLearningCore.EmptyHook
— TypeDo nothing
ReinforcementLearningCore.EpsilonGreedyExplorer
— TypeEpsilonGreedyExplorer{T}(;kwargs...)
EpsilonGreedyExplorer(ϵ) -> EpsilonGreedyExplorer{:linear}(; ϵ_stable = ϵ)
Epsilon-greedy strategy: The best lever is selected for a proportion
1 - epsilon
of the trials, and a lever is selected at random (with uniform probability) for a proportion epsilon . Multi-armed_bandit
Two kinds of epsilon-decreasing strategy are implemented here (linear
and exp
).
Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of epsilon decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish. - Multi-armed_bandit
Keywords
T::Symbol
: defines how to calculate the epsilon in the warmup steps. Supported values arelinear
andexp
.step::Int = 1
: record the current step.ϵ_init::Float64 = 1.0
: initial epsilon.warmup_steps::Int=0
: the number of steps to useϵ_init
.decay_steps::Int=0
: the number of steps for epsilon to decay fromϵ_init
toϵ_stable
.ϵ_stable::Float64
: the epsilon afterwarmup_steps + decay_steps
.is_break_tie=false
: randomly select an action of the same maximum values if set totrue
.rng=Random.GLOBAL_RNG
: set the internal RNG.is_training=true
: when not in training mode,step
will not be updated. And theϵ
will be set to 0.
Example
s_lin = EpsilonGreedyExplorer(kind=:linear, ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot([RLCore.get_ϵ(s_lin, i) for i in 1:500], label="linear epsilon")
s_exp = EpsilonGreedyExplorer(kind=:exp, ϵ_init=0.9, ϵ_stable=0.1, warmup_steps=100, decay_steps=100)
plot!([RLCore.get_ϵ(s_exp, i) for i in 1:500], label="exp epsilon")
ReinforcementLearningCore.EpsilonGreedyExplorer
— Method(s::EpsilonGreedyExplorer)(values; step) where T
If multiple values with the same maximum value are found. Then a random one will be returned!
NaN
will be filtered unless all the values are NaN
. In that case, a random one will be returned.
ReinforcementLearningCore.Experiment
— TypeExperiment(policy, env, stop_condition, hook, description)
These are the four essential components in a typical reinforcement learning experiment:
policy
, generates an action during the interaction with theenv
. It may update its strategy in the meanwhile.env
, the environment we're going to experiment with.stop_condition
, defines the when the experiment terminates.hook
, collects some intermediate data during the experiment.description
, displays some useful information for logging.
ReinforcementLearningCore.GaussianNetwork
— TypeGaussianNetwork(;pre=identity, μ, logσ, min_σ=0f0, max_σ=Inf32, normalizer = tanh)
Returns μ
and logσ
when called. Create a distribution to sample from using Normal.(μ, exp.(logσ))
. min_σ
and max_σ
are used to clip the output from logσ
. Actions are normalized according to the specified normalizer function.
ReinforcementLearningCore.GaussianNetwork
— Method(model::GaussianNetwork)(rng::AbstractRNG, state, action_samples::Int)
Sample action_samples
actions from each state. Returns a 3D tensor with dimensions (actionsize x actionsamples x batchsize). state
must be 3D tensor with dimensions (statesize x 1 x batch_size). Always returns the logpdf of each action along.
ReinforcementLearningCore.GaussianNetwork
— MethodThis function is compatible with a multidimensional action space. When outputting an action, it uses the normalizer
function to normalize it elementwise.
rng::AbstractRNG=Random.GLOBAL_RNG
is_sampling::Bool=false
, whether to sample from the obtained normal distribution.is_return_log_prob::Bool=false
, whether to calculate the conditional probability of getting actions in the given state.
ReinforcementLearningCore.MultiAgentHook
— TypeMultiAgentHook(player=>hook...)
ReinforcementLearningCore.MultiAgentManager
— MethodMultiAgentManager(player => policy...)
This is the simplest form of multiagent system. At each step they observe the environment from their own perspective and get updated independently. For environments of SEQUENTIAL
style, agents which are not the current player will observe a dummy action of NO_OP
in the PreActStage
. For environments of SIMULTANEOUS
style, please wrap it with SequentialEnv
first.
ReinforcementLearningCore.NamedPolicy
— TypeNamedPolicy(name=>policy)
A policy wrapper to provide a name. Mostly used in multi-agent environments.
ReinforcementLearningCore.NeuralNetworkApproximator
— TypeNeuralNetworkApproximator(;kwargs)
Use a DNN model for value estimation.
Keyword arguments
model
, a Flux based DNN model.optimizer=nothing
ReinforcementLearningCore.NoOp
— TypeRepresent no-operation if it's not the agent's turn.
ReinforcementLearningCore.PerturbationNetwork
— MethodThis function accepts state
and action
, and then outputs actions after disturbance.
ReinforcementLearningCore.QBasedPolicy
— TypeQBasedPolicy(;learner::Q, explorer::S)
Use a Q-learner
to generate estimations of action values. Then an explorer
is applied on the estimations to select an action.
ReinforcementLearningCore.RandomPolicy
— TypeRandomPolicy(action_space=nothing; rng=Random.GLOBAL_RNG)
If action_space
is nothing
, then it will use the legal_action_space
at runtime to randomly select an action. Otherwise, a random element within action_space
is selected.
You should always set action_space=nothing
when dealing with environments of FULL_ACTION_SET
.
ReinforcementLearningCore.RewardsPerEpisode
— TypeRewardsPerEpisode(; rewards = Vector{Vector{Float64}}())
Store each reward of each step in every episode in the field of rewards
.
ReinforcementLearningCore.StackFrames
— TypeStackFrames(::Type{T}=Float32, d::Int...)
Use a pre-initialized CircularArrayBuffer
to store the latest several states specified by d
. Before processing any observation, the buffer is filled with `zero{T} by default.
ReinforcementLearningCore.StepsPerEpisode
— TypeStepsPerEpisode(; steps = Int[], count = 0)
Store steps of each episode in the field of steps
.
ReinforcementLearningCore.StopAfterEpisode
— TypeStopAfterEpisode(episode; cur = 0, is_show_progress = true)
Return true
after being called episode
. If is_show_progress
is true
, the ProgressMeter
will be used to show progress.
ReinforcementLearningCore.StopAfterNSeconds
— TypeStopAfterNSeconds
parameter:
- time budget
stop training after N seconds
ReinforcementLearningCore.StopAfterNoImprovement
— TypeStopAfterNoImprovement()
Stop training when a monitored metric has stopped improving.
Parameters:
fn: a closure, return a scalar value, which indicates the performance of the policy (the higher the better) e.g.
- () -> reward(env)
- () -> totalrewardper_episode.reward
patience: Number of epochs with no improvement after which training will be stopped.
δ: Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.
Return true
after the monitored metric has stopped improving.
ReinforcementLearningCore.StopAfterStep
— TypeStopAfterStep(step; cur = 1, is_show_progress = true)
Return true
after being called step
times.
ReinforcementLearningCore.StopSignal
— TypeStopSignal()
Create a stop signal initialized with a value of false
. You can manually set it to true
by s[] = true
to stop the running loop at any time.
ReinforcementLearningCore.StopWhenDone
— TypeStopWhenDone()
Return true
if the environment is terminated.
ReinforcementLearningCore.SumTree
— TypeSumTree(capacity::Int)
Efficiently sample and update weights. For more details, see the post at here. Here we use a vector to represent the binary tree. Suppose we will have capacity
leaves at most. Every time we push!
new node into the tree, only the recent capacity
node and their sum will be updated! [––––––Parent nodes––––––][––––leaves––––] [size: 2^ceil(Int, log2(capacity))-1 ][ size: capacity ]
Example
julia> t = SumTree(8)
0-element SumTree
julia> for i in 1:16
push!(t, i)
end
julia> t
8-element SumTree:
9.0
10.0
11.0
12.0
13.0
14.0
15.0
16.0
julia> sample(t)
(2, 10.0)
julia> sample(t)
(1, 9.0)
julia> inds, ps = sample(t,100000)
([8, 4, 8, 1, 5, 2, 2, 7, 6, 6 … 1, 1, 7, 1, 6, 1, 5, 7, 2, 7], [16.0, 12.0, 16.0, 9.0, 13.0, 10.0, 10.0, 15.0, 14.0, 14.0 … 9.0, 9.0, 15.0, 9.0, 14.0, 9.0, 13.0, 15.0, 10.0, 15.0])
julia> countmap(inds)
Dict{Int64,Int64} with 8 entries:
7 => 14991
4 => 12019
2 => 10003
3 => 11027
5 => 12971
8 => 16052
6 => 13952
1 => 8985
julia> countmap(ps)
Dict{Float64,Int64} with 8 entries:
9.0 => 8985
13.0 => 12971
10.0 => 10003
14.0 => 13952
16.0 => 16052
11.0 => 11027
15.0 => 14991
12.0 => 12019
ReinforcementLearningCore.TabularApproximator
— TypeTabularApproximator(table<:AbstractArray, opt)
For table
of 1-d, it will serve as a state value approximator. See TabularVApproximator
. For table
of 2-d, it will serve as a state-action value approximator. See TabularQApproximator
.
Note that actions and states should be presented to TabularApproximator
as integers starting from 1 to be used as the index of the table. That is, e.g., RLBase.state_space
is expected to return Base.OneTo(n_state)
, where n_state
is the number of states.
For table
of 2-d, the first dimension is action and the second dimension is state.
ReinforcementLearningCore.TabularQApproximator
— MethodTabularQApproximator(; n_state, n_action, init = 0.0, opt = InvDecay(1.0))
An action-state value approximator represented by a 2-d table. init
is the initial value of each pair of action-state.
ReinforcementLearningCore.TabularRandomPolicy
— TypeTabularRandomPolicy(;table=Dict{Int, Float32}(), rng=Random.GLOBAL_RNG)
Use a Dict
to store action distribution.
ReinforcementLearningCore.TabularVApproximator
— MethodTabularVApproximator(; n_state, init = 0.0, opt = InvDecay(1.0))
A state value approximator represented by a 1-d table. init
is the initial value of each state.
ReinforcementLearningCore.TimePerStep
— TypeTimePerStep(;max_steps=100)
TimePerStep(times::CircularArrayBuffer{Float64}, t::UInt64)
Store time cost of the latest max_steps
in the times
field.
ReinforcementLearningCore.TotalBatchRewardPerEpisode
— MethodTotalBatchRewardPerEpisode(batch_size::Int; is_display_on_exit=true)
Similar to TotalRewardPerEpisode
, but is specific to environments which return a Vector
of rewards (a typical case with MultiThreadEnv
). If is_display_on_exit
is set to true
, a ribbon plot will be shown to reflect the mean and std of rewards.
ReinforcementLearningCore.TotalRewardPerEpisode
— TypeTotalRewardPerEpisode(; rewards = Float64[], reward = 0.0, is_display_on_exit = true)
Store the total reward of each episode in the field of rewards
. If is_display_on_exit
is set to true
, a unicode plot will be shown at the PostExperimentStage
.
ReinforcementLearningCore.Trajectory
— TypeTrajectory(;[trace_name=trace_container]...)
A simple wrapper of NamedTuple
. Define our own type here to avoid type piracy with NamedTuple
ReinforcementLearningCore.UCBExplorer
— MethodUCBExplorer(na; c=2.0, ϵ=1e-10, step=1, seed=nothing)
Arguments
na
is the number of actions used to create a internal counter.t
is used to store current time step.c
is used to control the degree of exploration.seed
, set the seed of inner RNG.is_training=true
, in training mode, time step and counter will not be updated.
ReinforcementLearningCore.UploadTrajectoryEveryNStep
— TypeUploadTrajectoryEveryNStep(;mailbox, n, sealer=deepcopy)
ReinforcementLearningCore.VAE
— TypeVAE(;encoder, decoder, latent_dims)
ReinforcementLearningCore.VBasedPolicy
— TypeVBasedPolicy(;learner, mapping=default_value_action_mapping)
The learner
must be a value learner. The mapping
is a function which returns an action given env
and the learner
. By default we iterate through all the valid actions and select the best one which lead to the maximum state value.
ReinforcementLearningCore.VectorSARTTrajectory
— MethodVectorSARTTrajectory(;kw...)
A specialized [VectorTrajectory
] with traces of SART
.
Keyword arguments
state::DataType = Int
action::DataType = Int
reward::DataType = Float32
terminal::DataType = Bool
ReinforcementLearningCore.VectorSATrajectory
— MethodVectorSATrajectory(;kw...)
A specialized [VectorTrajectory
] with traces of (:state, :action)
.
Keyword arguments
state::DataType = Int
action::DataType = Int
ReinforcementLearningCore.WeightedExplorer
— TypeWeightedExplorer(;is_normalized::Bool, rng=Random.GLOBAL_RNG)
is_normalized
is used to indicate if the fed action values are already normalized to have a sum of 1.0
.
Elements are assumed to be >=0
.
See also: WeightedSoftmaxExplorer
ReinforcementLearningCore.WeightedSoftmaxExplorer
— TypeWeightedSoftmaxExplorer(;rng=Random.GLOBAL_RNG)
See also: WeightedExplorer
Base.push!
— MethodWhen pushing a StackFrames
into a CircularArrayBuffer
of the same dimension, only the latest frame is pushed. If the StackFrames
is one dimension lower, then it is treated as a general AbstractArray
and is pushed in as a frame.
CUDA.device
— Methoddevice(model)
Detect the suitable running device for the model
. Return Val(:cpu)
by default.
ReinforcementLearningBase.priority
— Methodget_priority(p::AbstractLearner, experience)
ReinforcementLearningBase.prob
— Methodprob(p::AbstractExplorer, x, mask)
Similar to prob(p::AbstractExplorer, x)
, but here only the mask
ed elements are considered.
ReinforcementLearningBase.prob
— Methodprob(p::AbstractExplorer, x) -> AbstractDistribution
Get the action distribution given action values.
ReinforcementLearningBase.prob
— Methodprob(s::EpsilonGreedyExplorer, values) ->Categorical
prob(s::EpsilonGreedyExplorer, values, mask) ->Categorical
Return the probability of selecting each action given the estimated values
of each action.
ReinforcementLearningBase.update!
— Methodupdate!(a::AbstractApproximator, correction)
Usually the correction
is the gradient of inner parameters.
ReinforcementLearningBase.update!
— Methodupdate!(p::TabularRandomPolicy, state => value)
You should manually check value
sum to 1.0
.
ReinforcementLearningCore.ApproximatorStyle
— MethodUsed to detect what an AbstractApproximator
is approximating.
ReinforcementLearningCore.CircularArrayTrajectory
— MethodCircularArrayTrajectory(; capacity::Int, kw::Pair{<:DataType, <:Tuple{Vararg{Int}}}...)
A specialized Trajectory
which uses CircularArrayBuffer
as the underlying storage. kw
specifies the name, the element type and the size of each trace. capacity
is used to define the maximum length of the underlying buffer.
See also CircularArraySARTTrajectory
, CircularArraySLARTTrajectory
, CircularArrayPSARTTrajectory
.
ReinforcementLearningCore.CircularVectorTrajectory
— MethodCircularVectorTrajectory(;capacity, kw::DataType)
Similar to CircularArrayTrajectory
, except that the underlying storage is CircularVectorBuffer
.
Note the different type of the kw
between CircularVectorTrajectory
and CircularArrayTrajectory
. With CircularVectorBuffer
as the underlying storage, we don't need the size info.
See also CircularVectorSARTTrajectory
, CircularVectorSARTSATrajectory
.
ReinforcementLearningCore.ElasticArrayTrajectory
— MethodElasticArrayTrajectory(;[trace_name::Pair{<:DataType, <:Tuple{Vararg{Int}}}]...)
A specialized Trajectory
which uses ElasticArray
as the underlying storage. See also ElasticSARTTrajectory
.
ReinforcementLearningCore.VectorTrajectory
— MethodVectorTrajectory(;[trace_name::DataType]...)
A Trajectory
with each trace using a Vector
as the storage.
ReinforcementLearningCore._discount_rewards!
— Methodassuming rewards and new_rewards are Vector
ReinforcementLearningCore._generalized_advantage_estimation!
— Methodassuming rewards and advantages are Vector
ReinforcementLearningCore.check
— MethodInject some customized checkings here by overwriting this function
ReinforcementLearningCore.consecutive_view
— Methodconsecutive_view(x::AbstractArray, inds; n_stack = nothing, n_horizon = nothing)
By default, it behaves the same with select_last_dim(x, inds)
. If n_stack
is set to an int, then for each frame specified by inds
, the previous n_stack
frames (including the current one) are concatenated as a new dimension. If n_horizon
is set to an int, then for each frame specified by inds
, the next n_horizon
frames (including the current one) are concatenated as a new dimension.
Example
julia> x = collect(1:5)
5-element Array{Int64,1}:
1
2
3
4
5
julia> consecutive_view(x, [2,4]) # just the same with `select_last_dim(x, [2,4])`
2-element view(::Array{Int64,1}, [2, 4]) with eltype Int64:
2
4
julia> consecutive_view(x, [2,4];n_stack = 2)
2×2 view(::Array{Int64,1}, [1 3; 2 4]) with eltype Int64:
1 3
2 4
julia> consecutive_view(x, [2,4];n_horizon = 2)
2×2 view(::Array{Int64,1}, [2 4; 3 5]) with eltype Int64:
2 4
3 5
julia> consecutive_view(x, [2,4];n_horizon = 2, n_stack=2) # note the order here, first we stack, then we apply the horizon
2×2×2 view(::Array{Int64,1}, [1 2; 2 3]
[3 4; 4 5]) with eltype Int64:
[:, :, 1] =
1 2
2 3
[:, :, 2] =
3 4
4 5
See also Frame Skipping and Preprocessing for Deep Q networks to gain a better understanding of state stacking and n-step learning.
ReinforcementLearningCore.discount_rewards
— Methoddiscount_rewards(rewards::VectorOrMatrix, γ::Number;kwargs...)
Calculate the gain started from the current step with discount rate of γ
. rewards
can be a matrix.
Keyword arguments
dims=:
, ifrewards
is aMatrix
, thendims
can only be1
or2
.terminal=nothing
, specify if each reward follows by a terminal.nothing
means the game is not terminated yet. Ifterminal
is provided, then the size must be the same withrewards
.init=nothing
,init
can be used to provide the the reward estimation of the last state.
Example
ReinforcementLearningCore.flatten_batch
— Methodflatten_batch(x::AbstractArray)
Merge the last two dimension.
Example
julia> x = reshape(1:12, 2, 2, 3)
2×2×3 reshape(::UnitRange{Int64}, 2, 2, 3) with eltype Int64:
[:, :, 1] =
1 3
2 4
[:, :, 2] =
5 7
6 8
[:, :, 3] =
9 11
10 12
julia> flatten_batch(x)
2×6 reshape(::UnitRange{Int64}, 2, 6) with eltype Int64:
1 3 5 7 9 11
2 4 6 8 10 12
ReinforcementLearningCore.generalized_advantage_estimation
— Methodgeneralized_advantage_estimation(rewards::VectorOrMatrix, values::VectorOrMatrix, γ::Number, λ::Number;kwargs...)
Calculate the generalized advantage estimate started from the current step with discount rate of γ
and a lambda for GAE-Lambda of 'λ'. rewards
and 'values' can be a matrix.
Keyword arguments
dims=:
, ifrewards
is aMatrix
, thendims
can only be1
or2
.terminal=nothing
, specify if each reward follows by a terminal.nothing
means the game is not terminated yet. Ifterminal
is provided, then the size must be the same withrewards
.
Example
ReinforcementLearningCore.logdetLorU
— MethodlogdetLorU(LorU::AbstractMatrix)
Log-determinant of the Positive-Semi-Definite matrix A = L*U (cholesky lower and upper triangulars), given L or U. Has a sign uncertainty for non PSD matrices.
ReinforcementLearningCore.mvnormlogpdf
— Methodmvnormlogpdf(μ::AbstractVecOrMat, L::AbstractMatrix, x::AbstractVecOrMat)
GPU automatic differentiable version for the logpdf function of multivariate normal distributions. Takes as inputs mu
the mean vector, L
the lower triangular matrix of the cholesky decomposition of the covariance matrix, and x
a matrix of samples where each column is a sample. Return a Vector containing the logpdf of each column of x for the MvNormal
parametrized by μ
and Σ = L*L'
.
ReinforcementLearningCore.mvnormlogpdf
— Methodmvnormlogpdf(μ::A, LorU::A, x::A; ϵ = 1f-8) where A <: AbstractArray
Batch version that takes 3D tensors as input where each slice along the 3rd dimension is a batch sample. μ
is a (actionsize x 1 x batchsize) matrix, L
is a (actionsize x actionsize x batchsize), x is a (actionsize x actionsamples x batchsize). Return a 3D matrix of size (1 x actionsamples x batchsize).
ReinforcementLearningCore.normlogpdf
— Method normlogpdf(μ, σ, x; ϵ = 1.0f-8)
GPU automatic differentiable version for the logpdf function of normal distributions. Adding an epsilon value to guarantee numeric stability if sigma is exactly zero (e.g. if relu is used in output layer).
ReinforcementLearningCore.vec_to_tril
— MethodTransform a vector containing the non-zero elements of a lower triangular da x da matrix into that matrix.
StatsBase.sample
— Methodsample([rng=Random.GLOBAL_RNG], trajectory, sampler, [traces=Val(keys(trajectory))])
Here we return a copy instead of a view:
- Each sample is independent of the original
trajectory
so thattrajectory
can be updated async. - Copy is not always so bad.