# ReinforcementLearningZoo.jl

`ReinforcementLearningZoo.A2CGAELearner`

— Type`A2CGAELearner(;kwargs...)`

**Keyword arguments**

`approximator`

, an`ActorCritic`

based`NeuralNetworkApproximator`

`γ::Float32`

, reward discount rate.`λ::Float32`

, lambda for GAE-lambda`actor_loss_weight::Float32`

`critic_loss_weight::Float32`

`entropy_loss_weight::Float32`

`ReinforcementLearningZoo.A2CLearner`

— Type`A2CLearner(;kwargs...)`

**Keyword arguments**

`approximator`

::`ActorCritic`

`γ::Float32`

, reward discount rate.`actor_loss_weight::Float32`

`critic_loss_weight::Float32`

`entropy_loss_weight::Float32`

`update_freq::Int`

, usually set to the same with the length of trajectory.

`ReinforcementLearningZoo.BCQDLearner`

— Type`BCQDLearner(;kwargs)`

See paper: Benchmarking Batch Deep Reinforcement Learning Algorithms.

**Keyword arguments**

`approximator`

::`ActorCritic`

: used to get Q-values (Critic) and logits (Actor) of a state.`target_approximator`

::`ActorCritic`

: similar to`approximator`

, but used to estimate the target.`γ::Float32 = 0.99f0`

, reward discount rate.`τ::Float32 = 0.005f0`

, the speed at which the target network is updated.`θ::Float32 = 0.99f0`

, regularization coefficient.`threshold::Float32 = 0.3f0`

, determine whether the action can be used to calculate the Q value.`batch_size::Int=32`

`update_freq::Int`

: the frequency of updating the`approximator`

.`update_step::Int = 0`

`rng = Random.GLOBAL_RNG`

`ReinforcementLearningZoo.BCQLearner`

— Method`BCQLearner(;kwargs...)`

See Off-Policy Deep Reinforcement Learning without Exploration

**Keyword arguments**

`policy`

, used to get action with perturbation. This can be implemented using a`PerturbationNetwork`

in a`NeuralNetworkApproximator`

.`target_policy`

, similar to`policy`

, but used to estimate the target. This can be implemented using a`PerturbationNetwork`

in a`NeuralNetworkApproximator`

.`qnetwork1`

, used to get Q-values.`qnetwork2`

, used to get Q-values.`target_qnetwork1`

, used to estimate the target Q-values.`target_qnetwork2`

, used to estimate the target Q-values.`vae`

, used for sampling action. This

can be implemented using a `VAE`

in a `NeuralNetworkApproximator`

.

`γ::Float32 = 0.99f0`

, reward discount rate.`τ::Float32 = 0.005f0`

, the speed at which the target network is updated.`λ::Float32 = 0.75f0`

, used for Clipped Double Q-learning.`p::Int = 10`

, the number of state-action pairs used when calculating the Q value.`batch_size::Int = 32`

`start_step::Int = 1000`

`update_freq::Int = 50`

, the frequency of updating the`approximator`

.`update_step::Int = 0`

`rng = Random.GLOBAL_RNG`

`ReinforcementLearningZoo.BEARLearner`

— Method`BEARLearner(;kwargs...)`

See Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. This implmentation refers to the official python code.

**Keyword arguments**

`policy`

, used to get latent action.`target_policy`

, similar to`policy`

, but used to estimate the target.`qnetwork1`

, used to get Q-values.`qnetwork2`

, used to get Q-values.`target_qnetwork1`

, used to estimate the target Q-values.`target_qnetwork2`

, used to estimate the target Q-values.`vae`

, used for sampling action to calculate MMD loss. This

can be implemented using a `VAE`

in a `NeuralNetworkApproximator`

.

`log_α`

, lagrange multiplier implemented by a`NeuralNetworkApproximator`

.`γ::Float32 = 0.99f0`

, reward discount rate.`τ::Float32 = 0.005f0`

, the speed at which the target network is updated.`λ::Float32 = 0.75f0`

, used for Clipped Double Q-learning.`ε::Float32 = 0.05f0`

, threshold of MMD loss.`p::Int = 10`

, the number of state-action pairs used when calculating the Q value.`max_log_α::Float32 = 10.0f0`

, maximum value of`log_α`

.`min_log_α::Float32 = 10.0f0`

, minimum value of`log_α`

.`sample_num::Int = 10`

, the number of sample action to calculate MMD loss.`kernel_type::Symbol = :laplacian`

, the method of calculating MMD loss. Possible values: :laplacian/:gaussian.`mmd_σ::Float32 = 10.0f0`

, the parameter used for calculating MMD loss.`batch_size::Int = 32`

`update_freq::Int = 50`

, the frequency of updating the`approximator`

.`update_step::Int = 0`

`rng = Random.GLOBAL_RNG`

`ReinforcementLearningZoo.BasicDQNLearner`

— Type`BasicDQNLearner(;kwargs...)`

See paper: Playing Atari with Deep Reinforcement Learning

This is the very basic implementation of DQN. Compared to the traditional Q learning, the only difference is that, in the updating step it uses a batch of transitions sampled from an experience buffer instead of current transition. And the `approximator`

is usually a `NeuralNetworkApproximator`

. You can start from this implementation to understand how everything is organized and how to write your own customized algorithm.

**Keywords**

`approximator`

::`AbstractApproximator`

: used to get Q-values of a state.`loss_func`

: the loss function to use.`γ::Float32=0.99f0`

: discount rate.`batch_size::Int=32`

`min_replay_history::Int=32`

: number of transitions that should be experienced before updating the`approximator`

.`rng=Random.GLOBAL_RNG`

`ReinforcementLearningZoo.BehaviorCloningPolicy`

— Method`BehaviorCloningPolicy(;kw...)`

**Keyword Arguments**

`approximator`

: calculate the logits of possible actions directly`explorer=GreedyExplorer()`

`batch_size::Int = 32`

`min_reservoir_history::Int = 100`

, number of transitions that should be experienced before updating the`approximator`

.`rng = Random.GLOBAL_RNG`

`ReinforcementLearningZoo.BestResponsePolicy`

— Method`BestResponsePolicy(policy, env, best_responder)`

`policy`

, the original policy to be wrapped in the best response policy.`env`

, the environment to handle.`best_responder`

, the player to choose best response action.

`ReinforcementLearningZoo.CRRLearner`

— Type`CRRLearner(;kwargs)`

See paper: Critic Regularized Regression.

**Keyword arguments**

`approximator`

::`ActorCritic`

: used to get Q-values (Critic) and logits (Actor) of a state.`target_approximator`

::`ActorCritic`

: similar to`approximator`

, but used to estimate the target.`γ::Float32`

, reward discount rate.`batch_size::Int=32`

`policy_improvement_mode::Symbol=:exp`

, type of the weight function f. Possible values: :binary/:exp.`ratio_upper_bound::Float32`

, when`policy_improvement_mode`

is ":exp", the value of the exp function is upper-bounded by this parameter.`β::Float32`

, when`policy_improvement_mode`

is ":exp", this is the denominator of the exp function.`advantage_estimator::Symbol=:mean`

, type of the advantage estimate \hat{A}. Possible values: :mean/:max.`m::Int=4`

, when`continuous=true`

, sample`m`

action to estimate \hat{A}.`update_freq::Int`

: the frequency of updating the`approximator`

.`update_step::Int=0`

`target_update_freq::Int`

: the frequency of syncing`target_approximator`

.`continuous::Bool`

: type of action space.`rng = Random.GLOBAL_RNG`

`ReinforcementLearningZoo.DDPGPolicy`

— Method`DDPGPolicy(;kwargs...)`

**Keyword arguments**

`behavior_actor`

,`behavior_critic`

,`target_actor`

,`target_critic`

,`start_policy`

,`γ = 0.99f0`

,`ρ = 0.995f0`

,`batch_size = 32`

,`start_steps = 10000`

,`update_after = 1000`

,`update_freq = 50`

,`act_limit = 1.0`

,`act_noise = 0.1`

,`update_step = 0`

,`rng = Random.GLOBAL_RNG`

,

`ReinforcementLearningZoo.DQNLearner`

— Method`DQNLearner(;kwargs...)`

See paper: Human-level control through deep reinforcement learning

**Keywords**

`approximator`

::`AbstractApproximator`

: used to get Q-values of a state.`target_approximator`

::`AbstractApproximator`

: similar to`approximator`

, but used to estimate the target (the next state).`loss_func`

: the loss function.`γ::Float32=0.99f0`

: discount rate.`batch_size::Int=32`

`update_horizon::Int=1`

: length of update ('n' in n-step update).`min_replay_history::Int=32`

: number of transitions that should be experienced before updating the`approximator`

.`update_freq::Int=4`

: the frequency of updating the`approximator`

.`target_update_freq::Int=100`

: the frequency of syncing`target_approximator`

.`stack_size::Union{Int, Nothing}=4`

: use the recent`stack_size`

frames to form a stacked state.`traces = SARTS`

: set to`SLARTSL`

if you are to apply to an environment of`FULL_ACTION_SET`

.`rng = Random.GLOBAL_RNG`

`is_enable_double_DQN = Bool`

: enable double dqn, enabled by default.

`ReinforcementLearningZoo.DeepCFR`

— Type`DeepCFR(;kwargs...)`

Symbols used here follow the paper: Deep Counterfactual Regret Minimization

**Keyword arguments**

`K`

, number of traversal.`t`

, number of iteration.`Π`

, the policy network.`V`

, a dictionary of each player's advantage network.`MΠ`

, a strategy memory.`MV`

, a dictionary of each player's advantage memory.`reinitialize_freq=1`

, the frequency of re-initializing the value networks.

`ReinforcementLearningZoo.DoubleLearner`

— Type`DoubleLearner(;L1, L2, rng=Random.GLOBAL_RNG)`

This is a meta-learner, it will randomly select one learner and update another learner. The estimation of an observation is the sum of result from two learners.

`ReinforcementLearningZoo.EDManager`

— Type`EDManager(agents::Dict{Any, EDPolicy})`

A special MultiAgentManager in which all agents use Exploitability Descent(ED) algorithm to play the game.

`ReinforcementLearningZoo.EDPolicy`

— Type`EDPolicy(opponent, learner, explorer)`

Exploitability Descent(ED) algorithm directly updates the player’s policy against a worst case opponent(best reponse) in a two-player zero sum game. On each iteration, the ED algorithm performs the following update for each player:

- Construct a (deterministic) best response policy of the opponent to the current policy;
- Compute the value of every action in every state when playing current policy vs the best response;
- Update player's current policy to do better vs opponent's best response by performing a policy-gradient update.

**Keyword Arguments**

`opponent::Any`

, the opponent's name.`learner::NeuralNetworkApproximator`

, used to get the value of each action.`explorer::AbstractExplorer`

**Ref**

Computing Approximate Equilibria in Sequential Adversarial Games by Exploitability Descent

`ReinforcementLearningZoo.EnrichedAction`

— Type`EnrichedAction(action;kwargs...)`

Inject some runtime info into the action

`ReinforcementLearningZoo.ExperienceBasedSamplingModel`

— Type`ExperienceBasedSamplingModel`

Randomly generate a transition of (s, a, r, t, s′) based on previous experiences in each sampling.

`ReinforcementLearningZoo.ExternalSamplingMCCFRPolicy`

— Type`ExternalSamplingMCCFRPolicy`

This implementation uses stochastically-weighted averaging.

Ref:

`ReinforcementLearningZoo.FQE`

— Method`FQE(;kwargs...)`

See Hyperparameter Selection for Offline Reinforcement Learning

**Keyword arguments**

`policy`

, the policy for which FQE should be performed.`q_network`

, critic to evaluate the Q value of`state`

,`action`

pair.`target_q_network`

, target critic used for evaluating target Q values.`n_evals::Int`

, number of evaluations to perform to return the performance of the policy.`γ::Float32 = 0.99f0`

, discount factor.`batch_size::Int = 32`

.`update_freq::Int = 50`

, frequency of updating the`target_q_network`

.`update_step::Int = 0`

.`tar_update_freq::Int = 50`

`rng::AbstractRNG = Range.GLOBAL_RNG`

.

`policy`

is expected to be a pre-trained `GaussianNetwork`

with a particular choice of hyperparameters preferrably trained using the same `dataset`

.

`ReinforcementLearningZoo.FisherBRCLearner`

— Method`FisherBRCLearner(;kwargs...)`

See paper: Offline reinforcement learning with fisher divergence critic regularization.

**Keyword arguments**

`policy`

, used to get action.`behavior_policy::EntropyBC`

, used to estimate log μ(a|s).`qnetwork1`

, used to get Q-values.`qnetwork2`

, used to get Q-values.`target_qnetwork1`

, used to estimate the target Q-values.`target_qnetwork2`

, used to estimate the target Q-values.`γ::Float32 = 0.99f0`

, reward discount rate.`τ::Float32 = 0.005f0`

, the speed at which the target network is updated.`α::Float32 = 0.0f0`

, entropy term.`f_reg::Float32 = 1.0f0`

, the weight of gradient penalty regularizer.`reward_bonus::Float32 = 5.0f0`

, add extra value to the reward.`batch_size::Int = 32`

`pretrain_step::Int = 1000`

, the number of pre-training rounds.`update_freq::Int = 50`

, the frequency of updating the`approximator`

.`lr_alpha::Float32 = 0.003f0`

, learning rate of tuning entropy.`action_dims::Int = 0`

, the dimensionality of the action.`update_step::Int = 0`

`rng = Random.GLOBAL_RNG`

`policy`

is expected to output a tuple `(μ, logσ)`

of mean and log standard deviations for the desired action distributions, this can be implemented using a `GaussianNetwork`

in a `NeuralNetworkApproximator`

.

`ReinforcementLearningZoo.IQNLearner`

— Type`IQNLearner(;kwargs)`

See paper

**Keyword arguments**

`approximator`

, a`ImplicitQuantileNet`

`target_approximator`

, a`ImplicitQuantileNet`

, must have the same structure as`approximator`

`κ = 1.0f0`

,`N = 32`

,`N′ = 32`

,`Nₑₘ = 64`

,`K = 32`

,`γ = 0.99f0`

,`stack_size = 4`

,`batch_size = 32`

,`update_horizon = 1`

,`min_replay_history = 20000`

,`update_freq = 4`

,`target_update_freq = 8000`

,`update_step = 0`

,`default_priority = 1.0f2`

,`β_priority = 0.5f0`

,`rng = Random.GLOBAL_RNG`

,`device_seed = nothing`

,

`ReinforcementLearningZoo.ImplicitQuantileNet`

— Type`ImplicitQuantileNet(;ψ, ϕ, header)`

```
quantiles (n_action, n_quantiles, batch_size)
↑
header
↑
feature ↱ ⨀ ↰ transformed embedding
ψ ϕ
↑ ↑
s τ
```

`ReinforcementLearningZoo.MADDPGManager`

— Type`MADDPGManager(; agents::Dict{<:Any, <:Agent}, args...)`

Multi-agent Deep Deterministic Policy Gradient(MADDPG) implemented in Julia. By default, `MADDPGManager`

uses for simultaneous environments with continuous action space. See the paper https://arxiv.org/abs/1706.02275 for more details.

**Keyword arguments**

`agents::Dict{<:Any, <:Agent}`

, here each agent collects its own information. While updating the policy, each**critic**will assemble all agents' trajectory to update its own network.**Note that**here the policy of the`Agent`

should be`DDPGPolicy`

wrapped by`NamedPolicy`

, see the relative experiment(`MADDPG_KuhnPoker`

or`MADDPG_SpeakerListener`

) for references.`traces`

, set to`SARTS`

if you are apply to an environment of`MINIMAL_ACTION_SET`

, or`SLARTSL`

if you are to apply to an environment of`FULL_ACTION_SET`

.`batch_size::Int`

`update_freq::Int`

`update_step::Int`

, count the step.`rng::AbstractRNG`

.

`ReinforcementLearningZoo.MinimaxPolicy`

— Type`MinimaxPolicy(;value_function, depth::Int)`

The minimax algorithm with Alpha-beta pruning

**Keyword Arguments**

`maximum_depth::Int=30`

, the maximum depth of search.`value_function=nothing`

, estimate the value of`env`

.`value_function(env) -> Number`

. It is only called after searching for`maximum_depth`

and the`env`

is not terminated yet.

`ReinforcementLearningZoo.MonteCarloLearner`

— Type`MonteCarloLearner(;kwargs...)`

Use monte carlo method to estimate state value or state-action value.

**Fields**

`approximator`

::`TabularApproximator`

, can be either`TabularVApproximator`

or`TabularQApproximator`

.`γ=1.0`

, discount rate.`kind=FIRST_VISIT`

. Optional values are`FIRST_VISIT`

or`EVERY_VISIT`

.`sampling=NO_SAMPLING`

. Optional values are`NO_SAMPLING`

,`WEIGHTED_IMPORTANCE_SAMPLING`

or`ORDINARY_IMPORTANCE_SAMPLING`

.

`ReinforcementLearningZoo.MultiThreadEnv`

— Type`MultiThreadEnv(envs::Vector{<:AbstractEnv})`

Wrap multiple instances of the same environment type into one environment. Each environment will run in parallel by leveraging `Threads.@spawn`

. So remember to set the environment variable `JULIA_NUM_THREADS`

!

`ReinforcementLearningZoo.MultiThreadEnv`

— Method`MultiThreadEnv(f, n::Int)`

`f`

is a lambda function which creates an `AbstractEnv`

by calling `f()`

.

`ReinforcementLearningZoo.NFSPAgent`

— Type`NFSPAgent(; rl_agent::Agent, sl_agent::Agent, args...)`

Neural Fictitious Self-Play (NFSP) agent implemented in Julia. See the paper https://arxiv.org/abs/1603.01121 for more details.

**Keyword arguments**

`rl_agent::Agent`

, Reinforcement Learning(RL) agent(default`QBasedPolicy`

here, use`DQN`

for example), which works to search the best response from the self-play process.`sl_agent::Agent`

, Supervisor Learning(SL) agent(use`BehaviorCloningPolicy`

for example), which works to learn the best response from the rl_agent's policy.`η`

, anticipatory parameter, the probability to use`ϵ-greedy(Q)`

policy when training the agent.`rng=Random.GLOBAL_RNG`

.`update_freq::Int`

: the frequency of updating the agents'`approximator`

.`update_step::Int`

, count the step.`mode::Bool`

, used when learning, true as BestResponse(rl*agent's output), false as AveragePolicy(sl*agent's output).

`ReinforcementLearningZoo.NFSPAgentManager`

— Type`NFSPAgentManager(; agents::Dict{Any, NFSPAgent})`

A special MultiAgentManager in which all agents use NFSP policy to play the game.

`ReinforcementLearningZoo.OutcomeSamplingMCCFRPolicy`

— Type`OutcomeSamplingMCCFRPolicy`

This implementation uses stochastically-weighted averaging.

Ref:

`ReinforcementLearningZoo.PLASLearner`

— Method`PLASLearner(;kwargs...)`

See Latent Action Space for Offline Reinforcement Learning

**Keyword arguments**

`policy`

, used to get latent action.`target_policy`

, similar to`policy`

, but used to estimate the target.`qnetwork1`

, used to get Q-values.`qnetwork2`

, used to get Q-values.`target_qnetwork1`

, used to estimate the target Q-values.`target_qnetwork2`

, used to estimate the target Q-values.`vae`

, used for mapping hidden actions to actions. This

can be implemented using a `VAE`

in a `NeuralNetworkApproximator`

.

`γ::Float32 = 0.99f0`

, reward discount rate.`τ::Float32 = 0.005f0`

, the speed at which the target network is updated.`λ::Float32 = 0.75f0`

, used for Clipped Double Q-learning.`batch_size::Int = 32`

`pretrain_step::Int = 1000`

, the number of pre-training rounds.`update_freq::Int = 50`

, the frequency of updating the`approximator`

.`update_step::Int = 0`

`rng = Random.GLOBAL_RNG`

`ReinforcementLearningZoo.PPOPolicy`

— Type`PPOPolicy(;kwargs)`

**Keyword arguments**

`approximator`

,`γ = 0.99f0`

,`λ = 0.95f0`

,`clip_range = 0.2f0`

,`max_grad_norm = 0.5f0`

,`n_microbatches = 4`

,`n_epochs = 4`

,`actor_loss_weight = 1.0f0`

,`critic_loss_weight = 0.5f0`

,`entropy_loss_weight = 0.01f0`

,`dist = Categorical`

,`rng = Random.GLOBAL_RNG`

,

By default, `dist`

is set to `Categorical`

, which means it will only works on environments of discrete actions. To work with environments of continuous actions `dist`

should be set to `Normal`

and the `actor`

in the `approximator`

should be a `GaussianNetwork`

. Using it with a `GaussianNetwork`

supports multi-dimensional action spaces, though it only supports it under the assumption that the dimensions are independent since the `GaussianNetwork`

outputs a single `μ`

and `σ`

for each dimension which is used to simplify the calculations.

`ReinforcementLearningZoo.PrioritizedDQNLearner`

— Type`PrioritizedDQNLearner(;kwargs...)`

See paper: Prioritized Experience Replay And also https://danieltakeshi.github.io/2019/07/14/per/

**Keywords**

`approximator`

::`AbstractApproximator`

: used to get Q-values of a state.`target_approximator`

::`AbstractApproximator`

: similar to`approximator`

, but used to estimate the target (the next state).`loss_func`

: the loss function.`γ::Float32=0.99f0`

: discount rate.`batch_size::Int=32`

`update_horizon::Int=1`

: length of update ('n' in n-step update).`min_replay_history::Int=32`

: number of transitions that should be experienced before updating the`approximator`

.`update_freq::Int=4`

: the frequency of updating the`approximator`

.`target_update_freq::Int=100`

: the frequency of syncing`target_approximator`

.`stack_size::Union{Int, Nothing}=4`

: use the recent`stack_size`

frames to form a stacked state.`default_priority::Float64=100.`

: the default priority for newly added transitions.`rng = Random.GLOBAL_RNG`

Our implementation is slightly different from the original paper. But it should be aligned with the version in dopamine.

`ReinforcementLearningZoo.PrioritizedDQNLearner`

— MethodThe state of the observation is assumed to have been stacked, if `!isnothing(stack_size)`

.

`ReinforcementLearningZoo.PrioritizedSweepingSamplingModel`

— Type`PrioritizedSweepingSamplingModel(θ::Float64=1e-4)`

See more details at Section (8.4) on Page 168 of the book *Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.*

`ReinforcementLearningZoo.QRDQNLearner`

— Method`QRDQNLearner(;kwargs...)`

See paper: Distributional Reinforcement Learning with Quantile Regression

**Keywords**

`approximator`

::`AbstractApproximator`

: used to get quantile-values of a batch of states. The output should be of size`(n_quantile, n_action)`

.`target_approximator`

::`AbstractApproximator`

: similar to`approximator`

, but used to estimate the quantile values of the next state batch.`γ::Float32=0.99f0`

: discount rate.`batch_size::Int=32`

`update_horizon::Int=1`

: length of update ('n' in n-step update).`min_replay_history::Int=32`

: number of transitions that should be experienced before updating the`approximator`

.`update_freq::Int=1`

: the frequency of updating the`approximator`

.`n_quantile::Int=1`

: the number of quantiles.`target_update_freq::Int=100`

: the frequency of syncing`target_approximator`

.`stack_size::Union{Int, Nothing}=4`

: use the recent`stack_size`

frames to form a stacked state.`traces = SARTS`

, set to`SLARTSL`

if you are to apply to an environment of`FULL_ACTION_SET`

.`loss_func`

=`quantile_huber_loss`

.

`ReinforcementLearningZoo.REMDQNLearner`

— Method`REMDQNLearner(;kwargs...)`

See paper: An Optimistic Perspective on Offline Reinforcement Learning

**Keywords**

`approximator`

::`AbstractApproximator`

: used to get Q-values of a state.`target_approximator`

::`AbstractApproximator`

: similar to`approximator`

, but used to estimate the target (the next state).`loss_func`

: the loss function.`γ::Float32=0.99f0`

: discount rate.`batch_size::Int=32`

`update_horizon::Int=1`

: length of update ('n' in n-step update).`min_replay_history::Int=32`

: number of transitions that should be experienced before updating the`approximator`

.`update_freq::Int=4`

: the frequency of updating the`approximator`

.`ensemble_num::Int=1`

: the number of ensemble approximators.`ensemble_method::Symbol=:rand`

: the method of combining Q values. ':rand' represents random ensemble mixture, and ':mean' is the average.`target_update_freq::Int=100`

: the frequency of syncing`target_approximator`

.`stack_size::Union{Int, Nothing}=4`

: use the recent`stack_size`

frames to form a stacked state.`traces = SARTS`

, set to`SLARTSL`

if you are to apply to an environment of`FULL_ACTION_SET`

.`rng = Random.GLOBAL_RNG`

`ReinforcementLearningZoo.RainbowLearner`

— Type`RainbowLearner(;kwargs...)`

See paper: Rainbow: Combining Improvements in Deep Reinforcement Learning

**Keywords**

`approximator`

::`AbstractApproximator`

: used to get Q-values of a state.`target_approximator`

::`AbstractApproximator`

: similar to`approximator`

, but used to estimate the target (the next state).`loss_func`

: the loss function. It is recommended to use Flux.Losses.logitcrossentropy. Flux.Losses.crossentropy will encounter the problem of negative numbers.`Vₘₐₓ::Float32`

: the maximum value of distribution.`Vₘᵢₙ::Float32`

: the minimum value of distribution.`n_actions::Int`

: number of possible actions.`γ::Float32=0.99f0`

: discount rate.`batch_size::Int=32`

`update_horizon::Int=1`

: length of update ('n' in n-step update).`min_replay_history::Int=32`

: number of transitions that should be experienced before updating the`approximator`

.`update_freq::Int=4`

: the frequency of updating the`approximator`

.`target_update_freq::Int=500`

: the frequency of syncing`target_approximator`

.`stack_size::Union{Int, Nothing}=4`

: use the recent`stack_size`

frames to form a stacked state.`default_priority::Float32=1.0f2.`

: the default priority for newly added transitions. It must be`>= 1`

.`n_atoms::Int=51`

: the number of buckets of the value function distribution.`stack_size::Union{Int, Nothing}=4`

: use the recent`stack_size`

frames to form a stacked state.`rng = Random.GLOBAL_RNG`

`ReinforcementLearningZoo.SACPolicy`

— Method`SACPolicy(;kwargs...)`

**Keyword arguments**

`policy`

, used to get action.`qnetwork1`

, used to get Q-values.`qnetwork2`

, used to get Q-values.`target_qnetwork1`

, used to estimate the target Q-values.`target_qnetwork2`

, used to estimate the target Q-values.`start_policy`

,`γ::Float32 = 0.99f0`

, reward discount rate.`τ::Float32 = 0.005f0`

, the speed at which the target network is updated.`α::Float32 = 0.2f0`

, entropy term.`batch_size = 32`

,`start_steps = 10000`

,`update_after = 1000`

,`update_freq = 50`

,`automatic_entropy_tuning::Bool = false`

, whether to automatically tune the entropy.`lr_alpha::Float32 = 0.003f0`

, learning rate of tuning entropy.`action_dims = 0`

, the dimensionality of the action. if`automatic_entropy_tuning = true`

, must enter this parameter.`update_step = 0`

,`rng = Random.GLOBAL_RNG`

,

`policy`

is expected to output a tuple `(μ, logσ)`

of mean and log standard deviations for the desired action distributions, this can be implemented using a `GaussianNetwork`

in a `NeuralNetworkApproximator`

.

Implemented based on http://arxiv.org/abs/1812.05905

`ReinforcementLearningZoo.TD3Policy`

— Method`TD3Policy(;kwargs...)`

**Keyword arguments**

`behavior_actor`

,`behavior_critic`

,`target_actor`

,`target_critic`

,`start_policy`

,`γ = 0.99f0`

,`ρ = 0.995f0`

,`batch_size = 32`

,`start_steps = 10000`

,`update_after = 1000`

,`update_freq = 50`

,`policy_freq = 2`

# frequency in which the actor performs a gradient update_step and critic target is updated`target_act_limit = 1.0`

, # noise added to actor target`target_act_noise = 0.1`

, # noise added to actor target`act_limit = 1.0`

, # noise added when outputing action`act_noise = 0.1`

, # noise added when outputing action`update_step = 0`

,`rng = Random.GLOBAL_RNG`

,

`ReinforcementLearningZoo.TabularCFRPolicy`

— Method`TabularCFRPolicy(;kwargs...)`

Some useful papers while implementing this algorithm:

- An Introduction to Counterfactual Regret Minimization
- MONTE CARLO SAMPLING AND REGRET MINIMIZATION FOR EQUILIBRIUM COMPUTATION AND DECISION-MAKING IN LARGE EXTENSIVE FORM GAMES
- Solving Large Imperfect Information Games Using CFR⁺
- Revisiting CFR⁺ and Alternating Updates
- Solving Imperfect-Information Games via Discounted Regret Minimization

**Keyword Arguments**

`is_alternating_update=true`

: If`true`

, we update the players alternatively.`is_reset_neg_regrets=true`

: Whether to use**regret matching⁺**.`is_linear_averaging=true`

`weighted_averaging_delay=0`

. The averaging delay in number of iterations. Only valid when`is_linear_averaging`

is set to`true`

.`state_type=String`

, the data type of information set.`rng=Random.GLOBAL_RNG`

`ReinforcementLearningZoo.TabularPolicy`

— Type`TabularPolicy(table=Dict{Int,Int}(),n_action=nothing)`

A `Dict`

is used internally to store the mapping from state to action. `n_action`

is required if you want to calculate the probability of the `TabularPolicy`

given a state (`prob(p::TabularPolicy, s)`

).

`ReinforcementLearningZoo.TimeBasedSamplingModel`

— Type`TimeBasedSamplingModel(n_actions::Int, κ::Float64 = 1e-4)`

`ReinforcementLearningZoo.VMPOPolicy`

— Type`VMPOPolicy(;kwargs)`

V-MPO, an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) that performs policy iteration based on a learned state-value function.

**Keyword arguments**

`approximator`

: an`ActorCritic`

based on`NeuralNetworkApproximator`

`update_freq`

: update policy every n timesteps`γ = 0.99f0`

: discount factor`ϵ_η = 0.02f0`

: temperature η hyperparameter`ϵ_α = 0.1f0`

: Lagrange multiplier α (discrete) hyperparameter`ϵ_αμ = 0.005f0`

: Lagrange multiplier α_mu (continuous) hyperparameter`ϵ_ασ = 0.00005f0`

: Lagrange multiplier α_σ (continuous) hyperparameter`n_epochs = 8`

: update policy for n epochs`dist = Categorical`

:`Categorical`

- discrete,`Normal`

- continuous`rng = Random.GLOBAL_RNG`

By default, `dist`

is set to `Categorical`

, which means it will only work on environments of discrete actions. To work with environments of continuous actions `dist`

should be set to `Normal`

and the `actor`

in the `approximator`

should be a `GaussianNetwork`

. This algorithm only supports one-dimensional action space for now.

**Ref paper**

V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control

`ReinforcementLearningZoo.VPGPolicy`

— TypeVanilla Policy Gradient

VPGPolicy(;kwargs)

**Keyword arguments**

`approximator`

,`baseline`

,`dist`

, distribution function of the action`γ`

, discount factor`α_θ`

, step size of policy parameter`α_w`

, step size of baseline parameter`batch_size`

,`rng`

,`loss`

,`baseline_loss`

,

if the action space is continuous, then the env should transform the action value, (such as using tanh), in order to make sure low ≤ value ≤ high

`ReinforcementLearningBase.update!`

— MethodEmpty the trajectory at the end of an episode

`ReinforcementLearningBase.update!`

— MethodRun one interation

`ReinforcementLearningBase.update!`

— MethodUpdate Π (policy network)

`ReinforcementLearningBase.update!`

— MethodRun one interation

`ReinforcementLearningBase.update!`

— MethodRun one interation

`ReinforcementLearningBase.update!`

— MethodRun one interation

`ReinforcementLearningBase.update!`

— MethodUpdate the `behavior_policy`

`ReinforcementLearningBase.update!`

— MethodOnly update at the end of an episode

`ReinforcementLearningCore._run`

— FunctionMany policy gradient based algorithms require that the `env`

is a `MultiThreadEnv`

to increase the diversity during training. So the training pipeline is different from the default one in `RLCore`

.

`ReinforcementLearningZoo.calculate_CQL_loss`

— Method`calculate_CQL_loss(q_value, action; method)`

See paper: Conservative Q-Learning for Offline Reinforcement Learning

`ReinforcementLearningZoo.cfr!`

— FunctionSymbol meanings:

π: reach prob π′: new reach prob π₋ᵢ: opponents' reach prob p: player to update. `nothing`

means simultaneous update. w: weight v: counterfactual value **before weighted by opponent's reaching probability** V: a vector containing the `v`

after taking each action with current information set. Used to calculate the **regret value**

`ReinforcementLearningZoo.external_sampling!`

— MethodCFR Traversal with External Sampling

`ReinforcementLearningZoo.gen_JuliaRL_dataset`

— Method`gen_JuliaRL_dataset(alg::Symbol, env::Symbol, type::AbstractString; dataset_size)`

Generate the dataset by trajectory from the trajectory obtained from the experiment (`alg`

+ `env`

). `type`

represents the method of collecting data. Possible values: random/medium/expert. `dataset_size`

is the size of the generated dataset.

`ReinforcementLearningZoo.masked_regret_matching`

— MethodThis is the specific regret matching method used in DeepCFR

`ReinforcementLearningZoo.policy_evaluation!`

— Method`policy_evaluation!(;V, π, model, γ, θ)`

**Keyword arguments**

`V`

, an`AbstractApproximator`

.`π`

, an`AbstractPolicy`

.`model`

, a distribution based environment model(given a state and action pair, return all possible reward, next state, termination info and corresponding probability).`γ::Float64`

, discount rate.`θ::Float64`

, threshold to stop evaluation.

`ReinforcementLearningZoo.update_advantage_networks`

— MethodUpdate advantage network