Enriching Offline Reinforcement Learning Algorithms in ReinforcementLearning.jl

This is the phase 2 technical report of the summer OSPP project Enriching Offline Reinforcement Learning Algorithms in ReinforcementLearning.jl. This report will be continuously updated during the project. Currently, this report contains specific weekly plans and completed work. After all the work is completed, this report will be organized into a complete version.

Table of content

  1. Technical Report
    1. Weekly Plans
        1. Week 7
        2. Week 8
    2. Completed Work
        1. Offline RL algorithms
          1. FisherBRC

Technical Report

This technical report is the second phase technical report of Project "Enriching Offline Reinforcement Learning Algorithms in ReinforcementLearning.jl" in OSPP. Now, it includes two components: weekly plans and completed work.

Weekly Plans

Week 7

This week, we will prepare to implement FisherBRC algorithm. FisherBRC is an improved version of SAC and BRAC.

Besides, we need some useful components in the task of continuous action space. For example, we only implement GaussianNetwork, but we need the Gaussian Mixture Model to handle many complex tasks.

Week 8

Last week we finished the FisherBRC algorithm.

This week we will read the papers of BEAR and UWAC (the improvement of BEAR), and their python code implementation.

And we continue to design and implement GMM (Gaussian Mixture Model).

Completed Work

Offline RL algorithms


The pseudocode of FisherBRC:

Firstly, it needs to pre-train a behavior policy μ\mu by Behavior Cloning. In official python implementation, it adds an entropy term in negative log-likelihood of actions in a given state. Mathematical formulation:

L(μ)=E[logμ(sa)+αH(μ)]\mathcal{L}(\mu) = \mathbb{E}[-\log \mu(s|a) + \alpha \mathcal{H}(\mu)]

Besides, it automatically adjusts entropy term like SAC:

J(α)=αEatμt[logμ(atst)+Hˉ]J(\alpha) = -\alpha \mathbb{E}_{a_t\sim \mu_t}[\log\mu(a_t|s_t) + \bar{\mathcal{H}}]

Where Hˉ\bar{\mathcal{H}} is target entropy. But in ReinforcementLearningZoo.jl, BehaviorCloningPolicy does not contain entropy terms and does not support continuous action space. So, we define EntropyBC:

mutable struct EntropyBC{A<:NeuralNetworkApproximator}
    # Logging

Users only need to set parameter policy and lr_alpha. policy usually uses a GaussianNetwork. lr_alpha is the learning rate of α, which is an entropy term. target_entropy is set to dim(A)-\dim(\mathcal{A}), and A\mathcal{A} is action space.

Afterwards, the FisherBRC learner is updated. When updating Actor, it adds an entropy term in Q-value loss and automatically adjusts entropy. It updates Critic by this equation:

minθJ(Oθ+logμ(as))+λEsD,aπϕ(s)[aOθ(s,a)2]\min_\theta J(O_\theta + \log\mu(a|s)) + \lambda \mathbb{E}_{s\sim D, a\sim \pi_\phi(\cdot|s)}[\|\nabla_a O_\theta(s,a)\|^2]

There are a few key concepts that need to be introduced. JJ is the standard Q-value loss function. Oθ(s,a)O_\theta(s,a) is offset network:

Qθ(s,a)=Oθ(s,a)+logμ(as)Q_\theta(s,a) = O_\theta(s,a) + \log\mu(a|s)

Instead of Qθ(s,a)Q_\theta(s,a), Oθ(s,a)O_\theta(s,a) will provide a richer representation of Q-values. However, this parameterization can potentially put us back in the fully-parameterized QθQ_\theta regime of vanilla actor critic. So it uses a gradient penalty regularizer of the form aOθ(s,a)\|\nabla_a O_\theta(s,a)\|. The implementation is as follows:

a_policy = l.policy(l.rng, s; is_sampling=true)
q_grad_1 = gradient(Flux.params(l.qnetwork1)) do
    q1 = l.qnetwork1(q_input) |> vec
    q1_grad_norm = gradient(Flux.params([a_policy])) do 
        q1_reg = mean(l.qnetwork1(vcat(s, a_policy)))
    reg = mean(q1_grad_norm[a_policy] .^ 2)
    loss = mse(q1 .+ log_μ, y) + l.f_reg * reg  # y is target value

Please refer to this link for specific code (link). The brief function parameters are as follows:

mutable struct FisherBRCLearner{BA1, BC1, BC2, R} <: AbstractLearner
    ### Omit other parameters

f_reg is the regularization parameter of aOθ(s,a)\|\nabla_a O_\theta(s,a)\|. reward_bonus is generally set to 5, which is added in the reward to improve performance. pretrain_step is used for pre-training behavior_policy. α, lr_alpha and target_entropy are parameters used to add an entropy term and automatically adjust the entropy.

Performance curve of FisherBRC algorithm in Pendulum (pertrain_step=100):

FisherBRC's performance is better than online SAC.


If you see mistakes or want to suggest changes, please create an issue in the source repository.