ReinforcementLearningZoo.jl

ReinforcementLearningZoo.A2CLearnerType
A2CLearner(;kwargs...)

Keyword arguments

  • approximator::ActorCritic
  • γ::Float32, reward discount rate.
  • actor_loss_weight::Float32
  • critic_loss_weight::Float32
  • entropy_loss_weight::Float32
  • update_freq::Int, usually set to the same with the length of trajectory.
ReinforcementLearningZoo.BasicDQNLearnerType
BasicDQNLearner(;kwargs...)

See paper: Playing Atari with Deep Reinforcement Learning

This is the very basic implementation of DQN. Compared to the traditional Q learning, the only difference is that, in the updating step it uses a batch of transitions sampled from an experience buffer instead of current transition. And the approximator is usually a NeuralNetworkApproximator. You can start from this implementation to understand how everything is organized and how to write your own customized algorithm.

Keywords

  • approximator::AbstractApproximator: used to get Q-values of a state.
  • loss_func: the loss function to use.
  • γ::Float32=0.99f0: discount rate.
  • batch_size::Int=32
  • min_replay_history::Int=32: number of transitions that should be experienced before updating the approximator.
  • rng=Random.GLOBAL_RNG
ReinforcementLearningZoo.BestResponsePolicyMethod
BestResponsePolicy(policy, env, best_responder)
  • policy, the original policy to be wrapped in the best response policy.
  • env, the environment to handle.
  • best_responder, the player to choose best response action.
ReinforcementLearningZoo.DDPGPolicyMethod
DDPGPolicy(;kwargs...)

Keyword arguments

  • behavior_actor,
  • behavior_critic,
  • target_actor,
  • target_critic,
  • start_policy,
  • γ = 0.99f0,
  • ρ = 0.995f0,
  • batch_size = 32,
  • start_steps = 10000,
  • update_after = 1000,
  • update_every = 50,
  • act_limit = 1.0,
  • act_noise = 0.1,
  • step = 0,
  • rng = Random.GLOBAL_RNG,
ReinforcementLearningZoo.DQNLearnerMethod
DQNLearner(;kwargs...)

See paper: Human-level control through deep reinforcement learning

Keywords

  • approximator::AbstractApproximator: used to get Q-values of a state.
  • target_approximator::AbstractApproximator: similar to approximator, but used to estimate the target (the next state).
  • loss_func: the loss function.
  • γ::Float32=0.99f0: discount rate.
  • batch_size::Int=32
  • update_horizon::Int=1: length of update ('n' in n-step update).
  • min_replay_history::Int=32: number of transitions that should be experienced before updating the approximator.
  • update_freq::Int=4: the frequency of updating the approximator.
  • target_update_freq::Int=100: the frequency of syncing target_approximator.
  • stack_size::Union{Int, Nothing}=4: use the recent stack_size frames to form a stacked state.
  • traces = SARTS: set to SLARTSL if you are to apply to an environment of FULL_ACTION_SET.
  • rng = Random.GLOBAL_RNG
  • is_enable_double_DQN = Bool: enable double dqn, enabled by default.
ReinforcementLearningZoo.DeepCFRType
DeepCFR(;kwargs...)

Symbols used here follow the paper: Deep Counterfactual Regret Minimization

Keyword arguments

  • K, number of traverrsal.
  • t, number of iteration.
  • Π, the policy network.
  • V, a dictionary of each player's advantage network.
  • , a strategy memory.
  • MV, a dictionary of each player's advantage memory.
  • reinitialize_freq=1, the frequency of reinitializing the value networks.
ReinforcementLearningZoo.DoubleLearnerType
DoubleLearner(;L1, L2, rng=Random.GLOBAL_RNG)

This is a meta-learner, it will randomly select one learner and update another learner. The estimation of an observation is the sum of result from two learners.

ReinforcementLearningZoo.DuelingNetworkType
DuelingNetwork(;base, val, adv)

Dueling network automatically produces separate estimates of the state value function network and advantage function network. The expected output size of val is 1, and adv is the size of the action space.

ReinforcementLearningZoo.IQNLearnerType
IQNLearner(;kwargs)

See paper

Keyworkd arugments

  • approximator, a ImplicitQuantileNet
  • target_approximator, a ImplicitQuantileNet, must have the same structure as approximator
  • κ = 1.0f0,
  • N = 32,
  • N′ = 32,
  • Nₑₘ = 64,
  • K = 32,
  • γ = 0.99f0,
  • stack_size = 4,
  • batch_size = 32,
  • update_horizon = 1,
  • min_replay_history = 20000,
  • update_freq = 4,
  • target_update_freq = 8000,
  • update_step = 0,
  • default_priority = 1.0f2,
  • β_priority = 0.5f0,
  • rng = Random.GLOBAL_RNG,
  • device_seed = nothing,
ReinforcementLearningZoo.MinimaxPolicyType
MinimaxPolicy(;value_function, depth::Int)

The minimax algorithm with Alpha-beta pruning

Keyword Arguments

  • maximum_depth::Int=30, the maximum depth of search.
  • value_function=nothing, estimate the value of env. value_function(env) -> Number. It is only called after searching for maximum_depth and the env is not terminated yet.
ReinforcementLearningZoo.MonteCarloLearnerType
MonteCarloLearner(;kwargs...)

Use monte carlo method to estimate state value or state-action value.

Fields

  • approximator::TabularApproximator, can be either TabularVApproximator or TabularQApproximator.
  • γ=1.0, discount rate.
  • kind=FIRST_VISIT. Optional values are FIRST_VISIT or EVERY_VISIT.
  • sampling=NO_SAMPLING. Optional values are NO_SAMPLING, WEIGHTED_IMPORTANCE_SAMPLING or ORDINARY_IMPORTANCE_SAMPLING.
ReinforcementLearningZoo.MultiThreadEnvType
MultiThreadEnv(envs::Vector{<:AbstractEnv})

Wrap multiple instances of the same environment type into one environment. Each environment will run in parallel by leveraging Threads.@spawn. So remember to set the environment variable JULIA_NUM_THREADS!

ReinforcementLearningZoo.PPOPolicyType
PPOPolicy(;kwargs)

Keyword arguments

  • approximator,
  • γ = 0.99f0,
  • λ = 0.95f0,
  • clip_range = 0.2f0,
  • max_grad_norm = 0.5f0,
  • n_microbatches = 4,
  • n_epochs = 4,
  • actor_loss_weight = 1.0f0,
  • critic_loss_weight = 0.5f0,
  • entropy_loss_weight = 0.01f0,
  • dist = Categorical,
  • rng = Random.GLOBAL_RNG,

By default, dist is set to Categorical, which means it will only works on environments of discrete actions. To work with environments of continuous actions dist should be set to Normal and the actor in the approximator should be a GaussianNetwork. Using it with a GaussianNetwork supports multi-dimensional action spaces, though it only supports it under the assumption that the dimensions are independent since the GaussianNetwork outputs a single μ and σ for each dimension which is used to simplify the calculations.

ReinforcementLearningZoo.PrioritizedDQNLearnerType
PrioritizedDQNLearner(;kwargs...)

See paper: Prioritized Experience Replay And also https://danieltakeshi.github.io/2019/07/14/per/

Keywords

  • approximator::AbstractApproximator: used to get Q-values of a state.
  • target_approximator::AbstractApproximator: similar to approximator, but used to estimate the target (the next state).
  • loss_func: the loss function.
  • γ::Float32=0.99f0: discount rate.
  • batch_size::Int=32
  • update_horizon::Int=1: length of update ('n' in n-step update).
  • min_replay_history::Int=32: number of transitions that should be experienced before updating the approximator.
  • update_freq::Int=4: the frequency of updating the approximator.
  • target_update_freq::Int=100: the frequency of syncing target_approximator.
  • stack_size::Union{Int, Nothing}=4: use the recent stack_size frames to form a stacked state.
  • default_priority::Float64=100.: the default priority for newly added transitions.
  • rng = Random.GLOBAL_RNG
ReinforcementLearningZoo.QRDQNLearnerMethod
QRDQNLearner(;kwargs...)

See paper: Distributional Reinforcement Learning with Quantile Regression

Keywords

  • approximator::AbstractApproximator: used to get quantile-values of a batch of states. The output should be of size (n_quantile, n_action).
  • target_approximator::AbstractApproximator: similar to approximator, but used to estimate the quantile values of the next state batch.
  • γ::Float32=0.99f0: discount rate.
  • batch_size::Int=32
  • update_horizon::Int=1: length of update ('n' in n-step update).
  • min_replay_history::Int=32: number of transitions that should be experienced before updating the approximator.
  • update_freq::Int=1: the frequency of updating the approximator.
  • n_quantile::Int=1: the number of quantiles.
  • target_update_freq::Int=100: the frequency of syncing target_approximator.
  • stack_size::Union{Int, Nothing}=4: use the recent stack_size frames to form a stacked state.
  • traces = SARTS, set to SLARTSL if you are to apply to an environment of FULL_ACTION_SET.
  • loss_func=quantile_huber_loss.
ReinforcementLearningZoo.REMDQNLearnerMethod
REMDQNLearner(;kwargs...)

See paper: An Optimistic Perspective on Offline Reinforcement Learning

Keywords

  • approximator::AbstractApproximator: used to get Q-values of a state.
  • target_approximator::AbstractApproximator: similar to approximator, but used to estimate the target (the next state).
  • loss_func: the loss function.
  • γ::Float32=0.99f0: discount rate.
  • batch_size::Int=32
  • update_horizon::Int=1: length of update ('n' in n-step update).
  • min_replay_history::Int=32: number of transitions that should be experienced before updating the approximator.
  • update_freq::Int=4: the frequency of updating the approximator.
  • ensemble_num::Int=1: the number of ensemble approximators.
  • ensemble_method::Symbol=:rand: the method of combining Q values. ':rand' represents random ensemble mixture, and ':mean' is the average.
  • target_update_freq::Int=100: the frequency of syncing target_approximator.
  • stack_size::Union{Int, Nothing}=4: use the recent stack_size frames to form a stacked state.
  • traces = SARTS, set to SLARTSL if you are to apply to an environment of FULL_ACTION_SET.
  • rng = Random.GLOBAL_RNG
ReinforcementLearningZoo.RainbowLearnerType
RainbowLearner(;kwargs...)

See paper: Rainbow: Combining Improvements in Deep Reinforcement Learning

Keywords

  • approximator::AbstractApproximator: used to get Q-values of a state.
  • target_approximator::AbstractApproximator: similar to approximator, but used to estimate the target (the next state).
  • loss_func: the loss function. It is recommended to use Flux.Losses.logitcrossentropy. Flux.Losses.crossentropy will encounter the problem of negative numbers.
  • Vₘₐₓ::Float32: the maximum value of distribution.
  • Vₘᵢₙ::Float32: the minimum value of distribution.
  • n_actions::Int: number of possible actions.
  • γ::Float32=0.99f0: discount rate.
  • batch_size::Int=32
  • update_horizon::Int=1: length of update ('n' in n-step update).
  • min_replay_history::Int=32: number of transitions that should be experienced before updating the approximator.
  • update_freq::Int=4: the frequency of updating the approximator.
  • target_update_freq::Int=500: the frequency of syncing target_approximator.
  • stack_size::Union{Int, Nothing}=4: use the recent stack_size frames to form a stacked state.
  • default_priority::Float32=1.0f2.: the default priority for newly added transitions. It must be >= 1.
  • n_atoms::Int=51: the number of buckets of the value function distribution.
  • stack_size::Union{Int, Nothing}=4: use the recent stack_size frames to form a stacked state.
  • rng = Random.GLOBAL_RNG
ReinforcementLearningZoo.SACPolicyMethod
SACPolicy(;kwargs...)

Keyword arguments

  • policy,
  • qnetwork1,
  • qnetwork2,
  • target_qnetwork1,
  • target_qnetwork2,
  • start_policy,
  • γ = 0.99f0,
  • ρ = 0.995f0,
  • α = 0.2f0,
  • batch_size = 32,
  • start_steps = 10000,
  • update_after = 1000,
  • update_every = 50,
  • step = 0,
  • rng = Random.GLOBAL_RNG,

policy is expected to output a tuple (μ, logσ) of mean and log standard deviations for the desired action distributions, this can be implemented using a GaussianNetwork in a NeuralNetworkApproximator.

Implemented based on http://arxiv.org/abs/1812.05905

ReinforcementLearningZoo.TD3PolicyMethod
TD3Policy(;kwargs...)

Keyword arguments

  • behavior_actor,
  • behavior_critic,
  • target_actor,
  • target_critic,
  • start_policy,
  • γ = 0.99f0,
  • ρ = 0.995f0,
  • batch_size = 32,
  • start_steps = 10000,
  • update_after = 1000,
  • update_every = 50,
  • policy_freq = 2 # frequency in which the actor performs a gradient step and critic target is updated
  • target_act_limit = 1.0, # noise added to actor target
  • target_act_noise = 0.1, # noise added to actor target
  • act_limit = 1.0, # noise added when outputing action
  • act_noise = 0.1, # noise added when outputing action
  • step = 0,
  • rng = Random.GLOBAL_RNG,
ReinforcementLearningZoo.TabularCFRPolicyMethod
TabularCFRPolicy(;kwargs...)

Some useful papers while implementing this algorithm:

Keyword Arguments

  • is_alternating_update=true: If true, we update the players alternatively.
  • is_reset_neg_regrets=true: Whether to use regret matching⁺.
  • is_linear_averaging=true
  • weighted_averaging_delay=0. The averaging delay in number of iterations. Only valid when is_linear_averaging is set to true.
  • state_type=String, the data type of information set.
  • rng=Random.GLOBAL_RNG
ReinforcementLearningZoo.VPGPolicyType

Vanilla Policy Gradient

VPGPolicy(;kwargs)

Keyword arguments

  • approximator,
  • baseline,
  • dist, distribution function of the action
  • γ, discount factor
  • α_θ, step size of policy parameter
  • α_w, step size of baseline parameter
  • batch_size,
  • rng,
  • loss,
  • baseline_loss,

if the action space is continuous, then the env should transform the action value, (such as using tanh), in order to make sure low ≤ value ≤ high

ReinforcementLearningCore._runFunction

Many policy gradient based algorithms require that the env is a MultiThreadEnv to increase the diversity during training. So the training pipeline is different from the default one in RLCore.

ReinforcementLearningZoo.cfr!Function

Symbol meanings:

π: reach prob π′: new reach prob π₋ᵢ: opponents' reach prob p: player to update. nothing means simultaneous update. w: weight v: counterfactual value before weighted by opponent's reaching probability V: a vector containing the v after taking each action with current information set. Used to calculate the regret value

ReinforcementLearningZoo.policy_evaluation!Method
policy_evaluation!(;V, π, model, γ, θ)

Keyword arguments

  • V, an AbstractApproximator.
  • π, an AbstractPolicy.
  • model, a distribution based environment model(given a state and action pair, return all possible reward, next state, termination info and corresponding probability).
  • γ::Float64, discount rate.
  • θ::Float64, threshold to stop evaluation.