# Enriching Offline Reinforcement Learning Algorithms in ReinforcementLearning.jl

This is the phase 2 technical report of the summer OSPP project Enriching Offline Reinforcement Learning Algorithms in ReinforcementLearning.jl. This report will be continuously updated during the project. Currently, this report contains specific weekly plans and completed work. After all the work is completed, this report will be organized into a complete version.

### Table of content

1. Technical Report

# Technical Report

This technical report is the second phase technical report of Project "Enriching Offline Reinforcement Learning Algorithms in ReinforcementLearning.jl" in OSPP. Now, it includes two components: weekly plans and completed work.

## Weekly Plans

#### Week 7

This week, we will prepare to implement FisherBRC algorithm. FisherBRC is an improved version of SAC and BRAC.

• In first step, it trains a behavior policy with entropy (Implementation of official code). We need to implement a simple component because existing BehaviorCloningPolicy does not contain entropy terms and support continuous action space.

Besides, we need some useful components in the task of continuous action space. For example, we only implement GaussianNetwork, but we need the Gaussian Mixture Model to handle many complex tasks.

#### Week 8

Last week we finished the FisherBRC algorithm.

This week we will read the papers of BEAR and UWAC (the improvement of BEAR), and their python code implementation.

And we continue to design and implement GMM (Gaussian Mixture Model).

## Completed Work

#### Offline RL algorithms

##### FisherBRC

The pseudocode of FisherBRC:

Firstly, it needs to pre-train a behavior policy $\mu$ by Behavior Cloning. In official python implementation, it adds an entropy term in negative log-likelihood of actions in a given state. Mathematical formulation:

$\mathcal{L}(\mu) = \mathbb{E}[-\log \mu(s|a) + \alpha \mathcal{H}(\mu)]$

Besides, it automatically adjusts entropy term like SAC:

$J(\alpha) = -\alpha \mathbb{E}_{a_t\sim \mu_t}[\log\mu(a_t|s_t) + \bar{\mathcal{H}}]$

Where $\bar{\mathcal{H}}$ is target entropy. But in ReinforcementLearningZoo.jl, BehaviorCloningPolicy does not contain entropy terms and does not support continuous action space. So, we define EntropyBC:

mutable struct EntropyBC{A<:NeuralNetworkApproximator}
policy::A
α::Float32
lr_alpha::Float32
target_entropy::Float32
# Logging
policy_loss::Float32
end

Users only need to set parameter policy and lr_alpha. policy usually uses a GaussianNetwork. lr_alpha is the learning rate of α, which is an entropy term. target_entropy is set to $-\dim(\mathcal{A})$, and $\mathcal{A}$ is action space.

Afterwards, the FisherBRC learner is updated. When updating Actor, it adds an entropy term in Q-value loss and automatically adjusts entropy. It updates Critic by this equation:

$\min_\theta J(O_\theta + \log\mu(a|s)) + \lambda \mathbb{E}_{s\sim D, a\sim \pi_\phi(\cdot|s)}[\|\nabla_a O_\theta(s,a)\|^2]$

There are a few key concepts that need to be introduced. $J$ is the standard Q-value loss function. $O_\theta(s,a)$ is offset network:

$Q_\theta(s,a) = O_\theta(s,a) + \log\mu(a|s)$

Instead of $Q_\theta(s,a)$, $O_\theta(s,a)$ will provide a richer representation of Q-values. However, this parameterization can potentially put us back in the fully-parameterized $Q_\theta$ regime of vanilla actor critic. So it uses a gradient penalty regularizer of the form $\|\nabla_a O_\theta(s,a)\|$. The implementation is as follows:

a_policy = l.policy(l.rng, s; is_sampling=true)
q1 = l.qnetwork1(q_input) |> vec
q1_reg = mean(l.qnetwork1(vcat(s, a_policy)))
end
loss = mse(q1 .+ log_μ, y) + l.f_reg * reg  # y is target value
end

mutable struct FisherBRCLearner{BA1, BC1, BC2, R} <: AbstractLearner
### Omit other parameters
policy::BA1
behavior_policy::EntropyBC
qnetwork1::BC1
qnetwork2::BC2
target_qnetwork1::BC1
target_qnetwork2::BC2
α::Float32
f_reg::Float32
reward_bonus::Float32
pretrain_step::Int
lr_alpha::Float32
target_entropy::Float32
end

f_reg is the regularization parameter of $\|\nabla_a O_\theta(s,a)\|$. reward_bonus is generally set to 5, which is added in the reward to improve performance. pretrain_step is used for pre-training behavior_policy. α, lr_alpha and target_entropy are parameters used to add an entropy term and automatically adjust the entropy.

Performance curve of FisherBRC algorithm in Pendulum (pertrain_step=100):