ReinforcementLearningZoo.jl

ReinforcementLearningZoo.A2CLearnerType
A2CLearner(;kwargs...)

Keyword arguments

  • approximator::ActorCritic
  • γ::Float32, reward discount rate.
  • actor_loss_weight::Float32
  • critic_loss_weight::Float32
  • entropy_loss_weight::Float32
  • update_freq::Int, usually set to the same with the length of trajectory.
source
ReinforcementLearningZoo.BCQDLearnerType
BCQDLearner(;kwargs)

See paper: Benchmarking Batch Deep Reinforcement Learning Algorithms.

Keyword arguments

  • approximator::ActorCritic: used to get Q-values (Critic) and logits (Actor) of a state.
  • target_approximator::ActorCritic: similar to approximator, but used to estimate the target.
  • γ::Float32 = 0.99f0, reward discount rate.
  • τ::Float32 = 0.005f0, the speed at which the target network is updated.
  • θ::Float32 = 0.99f0, regularization coefficient.
  • threshold::Float32 = 0.3f0, determine whether the action can be used to calculate the Q value.
  • batch_size::Int=32
  • update_freq::Int: the frequency of updating the approximator.
  • update_step::Int = 0
  • rng = Random.GLOBAL_RNG
source
ReinforcementLearningZoo.BCQLearnerMethod
BCQLearner(;kwargs...)

See Off-Policy Deep Reinforcement Learning without Exploration

Keyword arguments

  • policy, used to get action with perturbation. This can be implemented using a PerturbationNetwork in a NeuralNetworkApproximator.
  • target_policy, similar to policy, but used to estimate the target. This can be implemented using a PerturbationNetwork in a NeuralNetworkApproximator.
  • qnetwork1, used to get Q-values.
  • qnetwork2, used to get Q-values.
  • target_qnetwork1, used to estimate the target Q-values.
  • target_qnetwork2, used to estimate the target Q-values.
  • vae, used for sampling action. This

can be implemented using a VAE in a NeuralNetworkApproximator.

  • γ::Float32 = 0.99f0, reward discount rate.
  • τ::Float32 = 0.005f0, the speed at which the target network is updated.
  • λ::Float32 = 0.75f0, used for Clipped Double Q-learning.
  • p::Int = 10, the number of state-action pairs used when calculating the Q value.
  • batch_size::Int = 32
  • start_step::Int = 1000
  • update_freq::Int = 50, the frequency of updating the approximator.
  • update_step::Int = 0
  • rng = Random.GLOBAL_RNG
source
ReinforcementLearningZoo.BEARLearnerMethod
BEARLearner(;kwargs...)

See Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. This implmentation refers to the official python code.

Keyword arguments

  • policy, used to get latent action.
  • target_policy, similar to policy, but used to estimate the target.
  • qnetwork1, used to get Q-values.
  • qnetwork2, used to get Q-values.
  • target_qnetwork1, used to estimate the target Q-values.
  • target_qnetwork2, used to estimate the target Q-values.
  • vae, used for sampling action to calculate MMD loss. This

can be implemented using a VAE in a NeuralNetworkApproximator.

  • log_α, lagrange multiplier implemented by a NeuralNetworkApproximator.
  • γ::Float32 = 0.99f0, reward discount rate.
  • τ::Float32 = 0.005f0, the speed at which the target network is updated.
  • λ::Float32 = 0.75f0, used for Clipped Double Q-learning.
  • ε::Float32 = 0.05f0, threshold of MMD loss.
  • p::Int = 10, the number of state-action pairs used when calculating the Q value.
  • max_log_α::Float32 = 10.0f0, maximum value of log_α.
  • sample_num::Int = 10, the number of sample action to calculate MMD loss.
  • kernel_type::Symbol = :laplacian, the method of calculating MMD loss. Possible values: :laplacian/:gaussian.
  • mmd_σ::Float32 = 10.0f0, the parameter used for calculating MMD loss.
  • batch_size::Int = 32
  • update_freq::Int = 50, the frequency of updating the approximator.
  • update_step::Int = 0
  • rng = Random.GLOBAL_RNG
source
ReinforcementLearningZoo.BasicDQNLearnerType
BasicDQNLearner(;kwargs...)

See paper: Playing Atari with Deep Reinforcement Learning

This is the very basic implementation of DQN. Compared to the traditional Q learning, the only difference is that, in the updating step it uses a batch of transitions sampled from an experience buffer instead of current transition. And the approximator is usually a NeuralNetworkApproximator. You can start from this implementation to understand how everything is organized and how to write your own customized algorithm.

Keywords

  • approximator::AbstractApproximator: used to get Q-values of a state.
  • loss_func: the loss function to use.
  • γ::Float32=0.99f0: discount rate.
  • batch_size::Int=32
  • min_replay_history::Int=32: number of transitions that should be experienced before updating the approximator.
  • rng=Random.GLOBAL_RNG
source
ReinforcementLearningZoo.BehaviorCloningPolicyMethod
BehaviorCloningPolicy(;kw...)

Keyword Arguments

  • approximator: calculate the logits of possible actions directly
  • explorer=GreedyExplorer()
  • batch_size::Int = 32
  • min_reservoir_history::Int = 100, number of transitions that should be experienced before updating the approximator.
  • rng = Random.GLOBAL_RNG
source
ReinforcementLearningZoo.BestResponsePolicyMethod
BestResponsePolicy(policy, env, best_responder)
  • policy, the original policy to be wrapped in the best response policy.
  • env, the environment to handle.
  • best_responder, the player to choose best response action.
source
ReinforcementLearningZoo.CRRLearnerType
CRRLearner(;kwargs)

See paper: Critic Regularized Regression.

Keyword arguments

  • approximator::ActorCritic: used to get Q-values (Critic) and logits (Actor) of a state.
  • target_approximator::ActorCritic: similar to approximator, but used to estimate the target.
  • γ::Float32, reward discount rate.
  • batch_size::Int=32
  • policy_improvement_mode::Symbol=:exp, type of the weight function f. Possible values: :binary/:exp.
  • ratio_upper_bound::Float32, when policy_improvement_mode is ":exp", the value of the exp function is upper-bounded by this parameter.
  • beta::Float32, when policy_improvement_mode is ":exp", this is the denominator of the exp function.
  • advantage_estimator::Symbol=:mean, type of the advantage estimate \hat{A}. Possible values: :mean/:max.
  • m::Int=4, when continuous=true, sample m action to estimate \hat{A}.
  • update_freq::Int: the frequency of updating the approximator.
  • update_step::Int=0
  • target_update_freq::Int: the frequency of syncing target_approximator.
  • continuous::Bool: type of action space.
  • rng = Random.GLOBAL_RNG
source
ReinforcementLearningZoo.DDPGPolicyMethod
DDPGPolicy(;kwargs...)

Keyword arguments

  • behavior_actor,
  • behavior_critic,
  • target_actor,
  • target_critic,
  • start_policy,
  • γ = 0.99f0,
  • ρ = 0.995f0,
  • batch_size = 32,
  • start_steps = 10000,
  • update_after = 1000,
  • update_freq = 50,
  • act_limit = 1.0,
  • act_noise = 0.1,
  • update_step = 0,
  • rng = Random.GLOBAL_RNG,
source
ReinforcementLearningZoo.DQNLearnerMethod
DQNLearner(;kwargs...)

See paper: Human-level control through deep reinforcement learning

Keywords

  • approximator::AbstractApproximator: used to get Q-values of a state.
  • target_approximator::AbstractApproximator: similar to approximator, but used to estimate the target (the next state).
  • loss_func: the loss function.
  • γ::Float32=0.99f0: discount rate.
  • batch_size::Int=32
  • update_horizon::Int=1: length of update ('n' in n-step update).
  • min_replay_history::Int=32: number of transitions that should be experienced before updating the approximator.
  • update_freq::Int=4: the frequency of updating the approximator.
  • target_update_freq::Int=100: the frequency of syncing target_approximator.
  • stack_size::Union{Int, Nothing}=4: use the recent stack_size frames to form a stacked state.
  • traces = SARTS: set to SLARTSL if you are to apply to an environment of FULL_ACTION_SET.
  • rng = Random.GLOBAL_RNG
  • is_enable_double_DQN = Bool: enable double dqn, enabled by default.
source
ReinforcementLearningZoo.DeepCFRType
DeepCFR(;kwargs...)

Symbols used here follow the paper: Deep Counterfactual Regret Minimization

Keyword arguments

  • K, number of traversal.
  • t, number of iteration.
  • Π, the policy network.
  • V, a dictionary of each player's advantage network.
  • , a strategy memory.
  • MV, a dictionary of each player's advantage memory.
  • reinitialize_freq=1, the frequency of re-initializing the value networks.
source
ReinforcementLearningZoo.DoubleLearnerType
DoubleLearner(;L1, L2, rng=Random.GLOBAL_RNG)

This is a meta-learner, it will randomly select one learner and update another learner. The estimation of an observation is the sum of result from two learners.

source
ReinforcementLearningZoo.EDPolicyType
EDPolicy

Exploitability Descent(ED) algorithm directly updates the player’s policy against a worst case opponent(best reponse) in a two-player zero sum game. On each iteration, the ED algorithm performs the following update for each player:

  1. Construct a (deterministic) best response policy of the opponent to the current policy;
  2. Compute the value of every action in every state when playing current policy vs the best response;
  3. Update player's current policy to do better vs opponent's best response by performing a policy-gradient update.

Keyword Arguments

  • opponent::Any, the opponent's name.
  • learner::NeuralNetworkApproximator, used to get the value of each action.
  • explorer::AbstractExplorer

Ref

Computing Approximate Equilibria in Sequential Adversarial Games by Exploitability Descent

source
ReinforcementLearningZoo.FisherBRCLearnerMethod
FisherBRCLearner(;kwargs...)

See paper: Offline reinforcement learning with fisher divergence critic regularization.

Keyword arguments

  • policy, used to get action.
  • behavior_policy::EntropyBC, used to estimate log μ(a|s).
  • qnetwork1, used to get Q-values.
  • qnetwork2, used to get Q-values.
  • target_qnetwork1, used to estimate the target Q-values.
  • target_qnetwork2, used to estimate the target Q-values.
  • γ::Float32 = 0.99f0, reward discount rate.
  • τ::Float32 = 0.005f0, the speed at which the target network is updated.
  • α::Float32 = 0.0f0, entropy term.
  • f_reg::Float32 = 1.0f0, the weight of gradient penalty regularizer.
  • reward_bonus::Float32 = 5.0f0, add extra value to the reward.
  • batch_size::Int = 32
  • pretrain_step::Int = 1000, the number of pre-training rounds.
  • update_freq::Int = 50, the frequency of updating the approximator.
  • lr_alpha::Float32 = 0.003f0, learning rate of tuning entropy.
  • action_dims::Int = 0, the dimensionality of the action.
  • update_step::Int = 0
  • rng = Random.GLOBAL_RNG

policy is expected to output a tuple (μ, logσ) of mean and log standard deviations for the desired action distributions, this can be implemented using a GaussianNetwork in a NeuralNetworkApproximator.

source
ReinforcementLearningZoo.IQNLearnerType
IQNLearner(;kwargs)

See paper

Keyword arguments

  • approximator, a ImplicitQuantileNet
  • target_approximator, a ImplicitQuantileNet, must have the same structure as approximator
  • κ = 1.0f0,
  • N = 32,
  • N′ = 32,
  • Nₑₘ = 64,
  • K = 32,
  • γ = 0.99f0,
  • stack_size = 4,
  • batch_size = 32,
  • update_horizon = 1,
  • min_replay_history = 20000,
  • update_freq = 4,
  • target_update_freq = 8000,
  • update_step = 0,
  • default_priority = 1.0f2,
  • β_priority = 0.5f0,
  • rng = Random.GLOBAL_RNG,
  • device_seed = nothing,
source
ReinforcementLearningZoo.MADDPGManagerType
MADDPGManager(; agents::Dict{<:Any, <:Agent}, args...)

Multi-agent Deep Deterministic Policy Gradient(MADDPG) implemented in Julia. By default, MADDPGManager uses for simultaneous environments with continuous action space. See the paper https://arxiv.org/abs/1706.02275 for more details.

Keyword arguments

  • agents::Dict{<:Any, <:Agent}, here each agent collects its own information. While updating the policy, each critic will assemble all agents' trajectory to update its own network. Note that here the policy of the Agent should be DDPGPolicy wrapped by NamedPolicy, see the relative experiment(MADDPG_KuhnPoker or MADDPG_SpeakerListener) for references.
  • traces, set to SARTS if you are apply to an environment of MINIMAL_ACTION_SET, or SLARTSL if you are to apply to an environment of FULL_ACTION_SET.
  • batch_size::Int
  • update_freq::Int
  • update_step::Int, count the step.
  • rng::AbstractRNG.
source
ReinforcementLearningZoo.MinimaxPolicyType
MinimaxPolicy(;value_function, depth::Int)

The minimax algorithm with Alpha-beta pruning

Keyword Arguments

  • maximum_depth::Int=30, the maximum depth of search.
  • value_function=nothing, estimate the value of env. value_function(env) -> Number. It is only called after searching for maximum_depth and the env is not terminated yet.
source
ReinforcementLearningZoo.MonteCarloLearnerType
MonteCarloLearner(;kwargs...)

Use monte carlo method to estimate state value or state-action value.

Fields

  • approximator::TabularApproximator, can be either TabularVApproximator or TabularQApproximator.
  • γ=1.0, discount rate.
  • kind=FIRST_VISIT. Optional values are FIRST_VISIT or EVERY_VISIT.
  • sampling=NO_SAMPLING. Optional values are NO_SAMPLING, WEIGHTED_IMPORTANCE_SAMPLING or ORDINARY_IMPORTANCE_SAMPLING.
source
ReinforcementLearningZoo.MultiThreadEnvType
MultiThreadEnv(envs::Vector{<:AbstractEnv})

Wrap multiple instances of the same environment type into one environment. Each environment will run in parallel by leveraging Threads.@spawn. So remember to set the environment variable JULIA_NUM_THREADS!

source
ReinforcementLearningZoo.NFSPAgentType
NFSPAgent(; rl_agent::Agent, sl_agent::Agent, args...)

Neural Fictitious Self-Play (NFSP) agent implemented in Julia. See the paper https://arxiv.org/abs/1603.01121 for more details.

Keyword arguments

  • rl_agent::Agent, Reinforcement Learning(RL) agent(default QBasedPolicy here, use DQN for example), which works to search the best response from the self-play process.
  • sl_agent::Agent, Supervisor Learning(SL) agent(use BehaviorCloningPolicy for example), which works to learn the best response from the rl_agent's policy.
  • η, anticipatory parameter, the probability to use ϵ-greedy(Q) policy when training the agent.
  • rng=Random.GLOBAL_RNG.
  • update_freq::Int: the frequency of updating the agents' approximator.
  • update_step::Int, count the step.
  • mode::Bool, used when learning, true as BestResponse(rlagent's output), false as AveragePolicy(slagent's output).
source
ReinforcementLearningZoo.PLASLearnerMethod
PLASLearner(;kwargs...)

See Latent Action Space for Offline Reinforcement Learning

Keyword arguments

  • policy, used to get latent action.
  • target_policy, similar to policy, but used to estimate the target.
  • qnetwork1, used to get Q-values.
  • qnetwork2, used to get Q-values.
  • target_qnetwork1, used to estimate the target Q-values.
  • target_qnetwork2, used to estimate the target Q-values.
  • vae, used for mapping hidden actions to actions. This

can be implemented using a VAE in a NeuralNetworkApproximator.

  • γ::Float32 = 0.99f0, reward discount rate.
  • τ::Float32 = 0.005f0, the speed at which the target network is updated.
  • λ::Float32 = 0.75f0, used for Clipped Double Q-learning.
  • batch_size::Int = 32
  • pretrain_step::Int = 1000, the number of pre-training rounds.
  • update_freq::Int = 50, the frequency of updating the approximator.
  • update_step::Int = 0
  • rng = Random.GLOBAL_RNG
source
ReinforcementLearningZoo.PPOPolicyType
PPOPolicy(;kwargs)

Keyword arguments

  • approximator,
  • γ = 0.99f0,
  • λ = 0.95f0,
  • clip_range = 0.2f0,
  • max_grad_norm = 0.5f0,
  • n_microbatches = 4,
  • n_epochs = 4,
  • actor_loss_weight = 1.0f0,
  • critic_loss_weight = 0.5f0,
  • entropy_loss_weight = 0.01f0,
  • dist = Categorical,
  • rng = Random.GLOBAL_RNG,

By default, dist is set to Categorical, which means it will only works on environments of discrete actions. To work with environments of continuous actions dist should be set to Normal and the actor in the approximator should be a GaussianNetwork. Using it with a GaussianNetwork supports multi-dimensional action spaces, though it only supports it under the assumption that the dimensions are independent since the GaussianNetwork outputs a single μ and σ for each dimension which is used to simplify the calculations.

source
ReinforcementLearningZoo.PrioritizedDQNLearnerType
PrioritizedDQNLearner(;kwargs...)

See paper: Prioritized Experience Replay And also https://danieltakeshi.github.io/2019/07/14/per/

Keywords

  • approximator::AbstractApproximator: used to get Q-values of a state.
  • target_approximator::AbstractApproximator: similar to approximator, but used to estimate the target (the next state).
  • loss_func: the loss function.
  • γ::Float32=0.99f0: discount rate.
  • batch_size::Int=32
  • update_horizon::Int=1: length of update ('n' in n-step update).
  • min_replay_history::Int=32: number of transitions that should be experienced before updating the approximator.
  • update_freq::Int=4: the frequency of updating the approximator.
  • target_update_freq::Int=100: the frequency of syncing target_approximator.
  • stack_size::Union{Int, Nothing}=4: use the recent stack_size frames to form a stacked state.
  • default_priority::Float64=100.: the default priority for newly added transitions.
  • rng = Random.GLOBAL_RNG
Note

Our implementation is slightly different from the original paper. But it should be aligned with the version in dopamine.

source
ReinforcementLearningZoo.QRDQNLearnerMethod
QRDQNLearner(;kwargs...)

See paper: Distributional Reinforcement Learning with Quantile Regression

Keywords

  • approximator::AbstractApproximator: used to get quantile-values of a batch of states. The output should be of size (n_quantile, n_action).
  • target_approximator::AbstractApproximator: similar to approximator, but used to estimate the quantile values of the next state batch.
  • γ::Float32=0.99f0: discount rate.
  • batch_size::Int=32
  • update_horizon::Int=1: length of update ('n' in n-step update).
  • min_replay_history::Int=32: number of transitions that should be experienced before updating the approximator.
  • update_freq::Int=1: the frequency of updating the approximator.
  • n_quantile::Int=1: the number of quantiles.
  • target_update_freq::Int=100: the frequency of syncing target_approximator.
  • stack_size::Union{Int, Nothing}=4: use the recent stack_size frames to form a stacked state.
  • traces = SARTS, set to SLARTSL if you are to apply to an environment of FULL_ACTION_SET.
  • loss_func=quantile_huber_loss.
source
ReinforcementLearningZoo.REMDQNLearnerMethod
REMDQNLearner(;kwargs...)

See paper: An Optimistic Perspective on Offline Reinforcement Learning

Keywords

  • approximator::AbstractApproximator: used to get Q-values of a state.
  • target_approximator::AbstractApproximator: similar to approximator, but used to estimate the target (the next state).
  • loss_func: the loss function.
  • γ::Float32=0.99f0: discount rate.
  • batch_size::Int=32
  • update_horizon::Int=1: length of update ('n' in n-step update).
  • min_replay_history::Int=32: number of transitions that should be experienced before updating the approximator.
  • update_freq::Int=4: the frequency of updating the approximator.
  • ensemble_num::Int=1: the number of ensemble approximators.
  • ensemble_method::Symbol=:rand: the method of combining Q values. ':rand' represents random ensemble mixture, and ':mean' is the average.
  • target_update_freq::Int=100: the frequency of syncing target_approximator.
  • stack_size::Union{Int, Nothing}=4: use the recent stack_size frames to form a stacked state.
  • traces = SARTS, set to SLARTSL if you are to apply to an environment of FULL_ACTION_SET.
  • rng = Random.GLOBAL_RNG
source
ReinforcementLearningZoo.RainbowLearnerType
RainbowLearner(;kwargs...)

See paper: Rainbow: Combining Improvements in Deep Reinforcement Learning

Keywords

  • approximator::AbstractApproximator: used to get Q-values of a state.
  • target_approximator::AbstractApproximator: similar to approximator, but used to estimate the target (the next state).
  • loss_func: the loss function. It is recommended to use Flux.Losses.logitcrossentropy. Flux.Losses.crossentropy will encounter the problem of negative numbers.
  • Vₘₐₓ::Float32: the maximum value of distribution.
  • Vₘᵢₙ::Float32: the minimum value of distribution.
  • n_actions::Int: number of possible actions.
  • γ::Float32=0.99f0: discount rate.
  • batch_size::Int=32
  • update_horizon::Int=1: length of update ('n' in n-step update).
  • min_replay_history::Int=32: number of transitions that should be experienced before updating the approximator.
  • update_freq::Int=4: the frequency of updating the approximator.
  • target_update_freq::Int=500: the frequency of syncing target_approximator.
  • stack_size::Union{Int, Nothing}=4: use the recent stack_size frames to form a stacked state.
  • default_priority::Float32=1.0f2.: the default priority for newly added transitions. It must be >= 1.
  • n_atoms::Int=51: the number of buckets of the value function distribution.
  • stack_size::Union{Int, Nothing}=4: use the recent stack_size frames to form a stacked state.
  • rng = Random.GLOBAL_RNG
source
ReinforcementLearningZoo.SACPolicyMethod
SACPolicy(;kwargs...)

Keyword arguments

  • policy, used to get action.
  • qnetwork1, used to get Q-values.
  • qnetwork2, used to get Q-values.
  • target_qnetwork1, used to estimate the target Q-values.
  • target_qnetwork2, used to estimate the target Q-values.
  • start_policy,
  • γ::Float32 = 0.99f0, reward discount rate.
  • τ::Float32 = 0.005f0, the speed at which the target network is updated.
  • α::Float32 = 0.2f0, entropy term.
  • batch_size = 32,
  • start_steps = 10000,
  • update_after = 1000,
  • update_freq = 50,
  • automatic_entropy_tuning::Bool = false, whether to automatically tune the entropy.
  • lr_alpha::Float32 = 0.003f0, learning rate of tuning entropy.
  • action_dims = 0, the dimensionality of the action. if automatic_entropy_tuning = true, must enter this parameter.
  • update_step = 0,
  • rng = Random.GLOBAL_RNG,

policy is expected to output a tuple (μ, logσ) of mean and log standard deviations for the desired action distributions, this can be implemented using a GaussianNetwork in a NeuralNetworkApproximator.

Implemented based on http://arxiv.org/abs/1812.05905

source
ReinforcementLearningZoo.TD3PolicyMethod
TD3Policy(;kwargs...)

Keyword arguments

  • behavior_actor,
  • behavior_critic,
  • target_actor,
  • target_critic,
  • start_policy,
  • γ = 0.99f0,
  • ρ = 0.995f0,
  • batch_size = 32,
  • start_steps = 10000,
  • update_after = 1000,
  • update_freq = 50,
  • policy_freq = 2 # frequency in which the actor performs a gradient update_step and critic target is updated
  • target_act_limit = 1.0, # noise added to actor target
  • target_act_noise = 0.1, # noise added to actor target
  • act_limit = 1.0, # noise added when outputing action
  • act_noise = 0.1, # noise added when outputing action
  • update_step = 0,
  • rng = Random.GLOBAL_RNG,
source
ReinforcementLearningZoo.TabularCFRPolicyMethod
TabularCFRPolicy(;kwargs...)

Some useful papers while implementing this algorithm:

Keyword Arguments

  • is_alternating_update=true: If true, we update the players alternatively.
  • is_reset_neg_regrets=true: Whether to use regret matching⁺.
  • is_linear_averaging=true
  • weighted_averaging_delay=0. The averaging delay in number of iterations. Only valid when is_linear_averaging is set to true.
  • state_type=String, the data type of information set.
  • rng=Random.GLOBAL_RNG
source
ReinforcementLearningZoo.TabularPolicyType
TabularPolicy(table=Dict{Int,Int}(),n_action=nothing)

A Dict is used internally to store the mapping from state to action. n_action is required if you want to calculate the probability of the TabularPolicy given a state (prob(p::TabularPolicy, s)).

source
ReinforcementLearningZoo.VMPOPolicyType
VMPOPolicy(;kwargs)

V-MPO, an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) that performs policy iteration based on a learned state-value function.

Keyword arguments

  • approximator: an ActorCritic based on NeuralNetworkApproximator
  • update_freq: update policy every n timesteps
  • γ = 0.99f0: discount factor
  • ϵ_η = 0.02f0: temperature η hyperparameter
  • ϵ_α = 0.1f0: Lagrange multiplier α (discrete) hyperparameter
  • ϵ_αμ = 0.005f0: Lagrange multiplier α_mu (continuous) hyperparameter
  • ϵ_ασ = 0.00005f0: Lagrange multiplier α_σ (continuous) hyperparameter
  • n_epochs = 8: update policy for n epochs
  • dist = Categorical: Categorical - discrete, Normal - continuous
  • rng = Random.GLOBAL_RNG

By default, dist is set to Categorical, which means it will only work on environments of discrete actions. To work with environments of continuous actions dist should be set to Normal and the actor in the approximator should be a GaussianNetwork. This algorithm only supports one-dimensional action space for now.

Ref paper

V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control

source
ReinforcementLearningZoo.VPGPolicyType

Vanilla Policy Gradient

VPGPolicy(;kwargs)

Keyword arguments

  • approximator,
  • baseline,
  • dist, distribution function of the action
  • γ, discount factor
  • α_θ, step size of policy parameter
  • α_w, step size of baseline parameter
  • batch_size,
  • rng,
  • loss,
  • baseline_loss,

if the action space is continuous, then the env should transform the action value, (such as using tanh), in order to make sure low ≤ value ≤ high

source
ReinforcementLearningCore._runFunction

Many policy gradient based algorithms require that the env is a MultiThreadEnv to increase the diversity during training. So the training pipeline is different from the default one in RLCore.

source
ReinforcementLearningZoo.cfr!Function

Symbol meanings:

π: reach prob π′: new reach prob π₋ᵢ: opponents' reach prob p: player to update. nothing means simultaneous update. w: weight v: counterfactual value before weighted by opponent's reaching probability V: a vector containing the v after taking each action with current information set. Used to calculate the regret value

source
ReinforcementLearningZoo.policy_evaluation!Method
policy_evaluation!(;V, π, model, γ, θ)

Keyword arguments

  • V, an AbstractApproximator.
  • π, an AbstractPolicy.
  • model, a distribution based environment model(given a state and action pair, return all possible reward, next state, termination info and corresponding probability).
  • γ::Float64, discount rate.
  • θ::Float64, threshold to stop evaluation.
source