An Introduction to ReinforcementLearning.jl: Design, Implementations and Thoughts

Reinforcement learning has led to many breakthroughs during the last several years. A number of packages mainly written in Python exist that provide stable and feature rich implementations of deep reinforcement learning algorithms. However, writing a reusable package that is both instructional and efficient is still challenging. To address this, we introduce ReinforcementLearning.jl, a package written in Julia which provides a collection of tools for reinforcement learning research. This blog aims to gain you a better understanding of the design decisions and implementation details behind this package. Hopefully, it will make your life easier to learn and apply different kinds of reinforcement learning algorithms.🤗

Table of content

Introduction

1.1 The Scope of ReinforcementLearning.jl

The goal of ReinforcementLearning.jl is to provide a collection of tools for learning and implementing reinforcement learning algorithms. This means that, unlike many other existing packages focusing on deep reinforcement learning only, we aim to build a rich ecosystem to solve different kinds of tasks in reinforcement learning, including but not limited to the following fields:

Tabular reinforcement learning
Deep reinforcement learning
Offline reinforcement learning
Multi-agent reinforcement learning
Model-based reinforcement learning

1.2 Why Julia?

Although most existing reinforcement learning related packages are written in Python with PyTorch or Tensorflow as the backend, we choose Julia because of the following advantages:

Multiple dispatch in Julia makes the code very easy to read, understand and extend. Without this feature, it's hard to imagine how to organize such a package targeting so many different tasks in an elegant way.
Composable Multi-threading boost the speed of interactions between agents and environments.
Julia is fast and easy to interact with other programming languages like Python, C, C++, fortran so that we can easily leverage many existing packages.

1.3 Related Work

Many existing packages inspired the development of ReinforcementLearning.jl a lot. Following are some important ones.

Dopamine provides a clear implementation of the Rainbow algorithm. The gin config file template and the concise workflow is the origin of the Experiment in ReinforcementLearning.jl.
OpenSpiel provides a lot of useful functions to describe many different kinds of games. They are turned into traits in our package.
Ray/rllib has many nice abstraction layers in the policy part. We also borrowed the definition of environments here. This is explained with details in section 2.
rlpyt has a nice code structure and we borrowed some implementations of policy gradient algorithms from it.
Acme offers a framework for distributed reinforcement learning.

Design

2.1 The General Workflow

The standard setting for reinforcement learning contains two parts: Agent and Environment. To make things general enough, we defined two abstract types: AbstractPolicy and AbstractEnv correspondingly.

A general workflow between policy and environment.

Figure 1: The general flow between a policy and an environment.

A policy simply takes a look at the environment and yields an action. And an environment will modify its internal state once received an action. So for continuing tasks, we can describe the workflow with the following code:

function Base.run(policy, env)
    while true
        env |> policy |> env
    end
end

For episodic tasks, two extra methods must be implemented for the env:

is_terminated(env::AbstractEnv) detects if the environment is already terminated or not.
reset!(env::AbstractEnv) resets the env to an initial state.

Then we have the following code:

function Base.run(policy, env)
    while true
        reset!(env)
        while !is_terminated(env)
            env |> policy |> env
        end
    end
end

In real world tasks, we never expect the workflow to run infinitely. Usually we will stop early after a number of steps/episodes or when the policy or environment meet some specific condition. Now let's introduce a stop_condition which examines policy and env after each step to control the workflow.

function Base.run(policy, env, stop_condition)
    while true
        reset!(env)
        while !is_terminated(env)
            env |> policy |> env
            stop_condition(policy, env) && return
        end
    end
end

Until now, a policy is still very general. We don't know it's actually updating or exploiting. However, in most cases, if a policy needs to be updated during interactions with the environment, it needs a buffer to collect some necessary information and then use it to update its strategy at some time. So a special policy named Agent is provided in our package to help update a policy.

An agent is simply a combination of any policy to be updated and a corresponding experience replay buffer (we call it AbstractTrajectory ). And we split the above workflow into the following stages to allow injecting different updating logics:

PreExperimentStage
PreEpisodeStage
PreActStage
PostActStage
PostEpisodeStage
PostExperimentStage

function Base.run(policy, env, stop_condition)
    while true
        reset!(env)
        policy(PRE_EPISODE_STAGE, env)
        while !is_terminated(env)
            action = policy(env)
            policy(PRE_ACT_STAGE, env, action)
            env(action)
            policy(POST_ACT_STAGE, env)
            stop_condition(policy, env) && return
        end
        policy(POST_EPISODE_STAGE, env)
    end
end

Note that by default, update!(policy::AbstractPolicy, ::AbstractStage, env) will do nothing and the workflow is simplified to the earlier version. For the specific policy Agent, it will update its internal policy and trajectory at each stage:

update!(policy, trajectory, env, stage) (Read as update! the inner policy given a trajectory, env and the current stage)
update!(trajectory, policy, env, stage) (Read as update! the inner trajectory given a policy, env and the current stage)
update!(trajectory, policy, env, ::PreActStage, action) (The action here is the one selected by the policy to be executed on the env)

By default, update!(policy, trajectory, env, stage) will do nothing. The default implementation of update!(trajectory, policy, env, stage) is explained in the 2.2 Trajectory section. But developers can customize these behaviors easily thanks to multiple dispatch.

Now the policy can play happily with the environment. We would also want to inject some meaningful actions in the meantime. For example, collecting the total reward of each episode, logging the loss of each updating, saving the policy periodically, modifying some hyperparameters on the fly and so on. So a general callback is introduced.

function Base.run(policy, env, stop_condition, hook)
    hook(PRE_EXPERIMENT_STAGE, env)
    while true
        reset!(env)
        policy(PRE_EPISODE_STAGE, env)
        hook(PRE_EPISODE_STAGE, env)
        while !is_terminated(env)
            action = policy(env)
            policy(PRE_ACT_STAGE, env, action)
            hook(PRE_ACT_STAGE, env, action)
            env(action)
            policy(POST_ACT_STAGE, env)
            hook(POST_ACT_STAGE, env)
            stop_condition(policy, env) && return
        end
        policy(POST_EPISODE_STAGE, env)
        hook(POST_EPISODE_STAGE, env)
    end
end

Except for some corner cases, the code above is very close to our implementation.

2.2 Trajectory

We use the AbstractTrajectory to describe the container used to collect intermediate data during the interactions between policy and environment. Among all the different implementations, the simplest one is Trajectory. Essentially, it's just a wrapper of a NamedTuple with some customized behaviors to avoid type piracy.

The default implementation of `VectorSARTTrajectory`.

The above figure illustrates the simplest trajectory: VectorSARTTrajectory. It contains four traces: state, action, reward and terminal. Each trace simply use a Vector as the container. By default, the trajectory is assumed to be of the SARTSA format (State, Action, Reward, Termination, next-State, next-Action). For different policies, developers may replace the inner container, add/remove traces or customize the inserting/removing/sampling strategy. All the available built-in trajectories can be found here.

2.3 Approximator

For environments with a small state space, we can use a table to record the state value, state-action value (or any other values we are interested in). While in environments with a large state space, we often use a neural network to estimate the values nowadays. To generalize these two cases, an AbstractApproximator is introduced. Basically it's just parameters (either tabular or neural network based) plus an optimizer. Another benefit of introducing such a layer is to hide the underlying DNN framework from developers so that we can easily switch from Flux.jl to Knet.jl or something else.

Algorithm Implementations

In this section, we'll give a brief introduction to different types of reinforcement learning algorithms implemented in our package. After that we'll relate these algorithms with the underlying components provided in our package. Developers should feel confident to implement new algorithms after reading this section. Note that all the code snippets here are for illustration only, they may differ from the one provided in the package.

Following the standard textbook definitions, the interactions between an agent and environment can be formalized as a fully-observable or partially-observable Markov decision process ((PO)MDP). Here we focus on the fully-observable MDP. The MDP is defined as a tuple $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{T}, d_0, r, \gamma)$ where

$\mathcal{S}$ is a set of states $s \in \mathcal{S}$
$\mathcal{A}$ is a set of actions
$\mathcal{T}$ is a conditional probability distribution $\mathcal{T}(s_{t+1}|s_t, a_t)$
$d_0$ is the initial state distribution of $s_0$
$r$ is a reward function: $\mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$
$\gamma \in (0, 1]$ is a scalar discount factor

At each time $t$ , the policy $\pi$ takes a look at the current state of the environment $s_t$ and produces an action $a_t = \pi(s_t)$ . After taking this action, we can fetch reward $r_t = r(s_t, a_t)$ , next state $s_{t+1}$ and the episode termination signal $e_t$ from the environment. The basic transition in each step is $\tau = (s_t, a_t, r_t, e_0, s_{t+1}, a_{t+1})$ , which forms the trajectory $\Tau = (s_t, a_t, r_t, e_0, ..., s_H, a_H)$ of VectorSARTTrajectory we mentioned above. The trajectory distribution $p_\pi$ for a given MDP $\mathcal{M}$ and policy $\pi$ is given by

p_\pi(\Tau) = d_0(s_0)\prod_{t=0}^H \pi(a_t|s_t) \mathcal{T}(s_{t+1}|s_t, a_t)

The only goal in reinforcement learning is to maximize some form of the expected return. And we focus on the discounted rewards below

R_t = \sum_{i \ge 0} \gamma^ir_t = r_t + \gamma R_{t+1}

3.1 State Value Function and State-Action Value function

Note that if we can estimate the state value or state-action value accurately, it will be easy to get a near-optimal policy. The state value function $V^\pi(s_t)$ estimates the return $R_t$ by following the policy $\pi$ starting from state $s_t$ :

V^\pi(s_t) = \mathbb{E}_{\Tau \sim p_\pi(\Tau | s_t)} R_t

Similarly, for state action value function $Q^\pi(s_t, a_t)$ , it estimates $R_t$ by following the policy $\pi$ starting from a state-action pair $(s_t, a_t)$ :

Q^\pi(s_t, a_t) = \mathbb{E}_{\Tau \sim p_\pi(\Tau | s_t, a_t)} R_t

We can derive the recursive definitions of these two functions:

V^\pi(s_t) = \mathbb{E}_{a_t \sim \pi(a_t|s_t)} [Q^\pi(s_t, a_t)]

Q^\pi(s_t, a_t) = r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1} \sim \mathcal{T}(s_{t+1}|s_t, a_t)}[V^\pi(s_{t+1})]

And combining the above two equations, we can describe $Q^\pi(s_t, a_t)$ in terms of $Q^\pi(s_{t+1}, a_{t+1})$ :

Q^\pi(s_t, a_t) = r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1} \sim \mathcal{T}(s_{t+1}|s_t, a_t), a_{t+1} \sim \pi (a_{t+1} | s_{t+1})}[Q^\pi(s_{t+1}, a_{t+1})]

With the policy of Q-function, $\pi(a_t|s_t) = \underset{a_t}{\argmax} \ Q(s_t, a_t)$ , we have the following condition on the optimal Q-function:

Q^\star (s_t, a_t) = r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1} \sim \mathcal{T}(s_{t+1}|s_t, a_t)} \left[ \underset{a_{t+1}}{\max}\ Q^\star(s_{t+1}, a_{t+1}) \right]

3.1.1 VBasedPolicy

To estimate the state value of some specific policy $V^\pi$ , a VBasedPolicy is provided in our package:

struct VBasedPolicy{L,M} <: AbstractPolicy
    learner::L
    mapper::M
end

RLBase.plan!(p::VBasedPolicy, env::AbstractEnv) = p.mapper(env, p.learner)

Basically it contains two parts, a value learner $V$ and a mapper policy $\pi$ . And all the updating to this policy will be forwarded to the inner learner.

Monte Carlo Prediction for Estimating $V^\pi$

Now let's start from a very basic value learner, the Monte Carlo based value learner. The idea is to apply the VBasedPolicy until the end of an episode. And update the value estimation of each state with the accumulated averaging $R(s)$ . There're two steps involved in implementing such a learner:

Updating the trajectory
Updating the learner

By default, the VectorSARTTrajectory we've seen above will add all the transitions into it. But this is not what we want in the Monte Carlo based value learner here. We only need one episode actually. So we can empty the trajectory at the beginning of each episode by implementing:

RLBase.update!(
    t::AbstractTrajectory,
    ::VBasedPolicy{MCLearner},
    ::AbstractEnv,
    ::PreEpisodeStage
)
    empty!(t)
end

Next, we update the learner only at the end of an episode. So we have the following method implemented:

RLBase.update!(
    ::MCLearner,
    ::AbstractTrajectory,
    ::AbstractEnv,
    ::PostEpisodeStage
)
    # update inner tabular approximator
end

Tabular TD(0) for estimating $V^\pi$

Another approach to estimate the value function is to use the zero step temporal-difference learning. The main difference compared to the MC method above is that, now we update the learner and trajectory at the PostActStage and PostEpisodeStage.

3.1.2 QBasedPolicy

In our package, the QBasedPolicy is provided for all the general Q function based policies. It consists of a Q value learner and an explorer to select action based on the estimated Q value.

struct QBasedPolicy{L, E} <: AbstractPolicy
    learner::L
    explorer::E
end

RLBase.plan!((p::QBasedPolicy, env::AbstractEnv) = env |> p.learner |> p.explorer

Similar to the VBasedPolicy, when wrapped in an Agent, all updating to QBasedPolicy will be forwarded to the inner value learner.

TabularQLearning

For tabular Q-learning, we use a TabularQApproximator to estimate the $Q(s_t, a_t)$ . At each PostActStage, we calculate the TD error:

\delta = r_t + \gamma \max_{a_{t+1} \sim \pi(a_{t+1}|s_{t+1})} Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)

And use it to update the inner TabularApproximator. From the API implementation level, if the QBasedPolicy{<:TabularQLearner} is used together with a VectorSARTTrajectory like we mentioned above, then we should remember to empty the trajectory at the end of each episode.

BasicDQN

To tackle problems with a large state space, a neural network $Q_\phi$ is used to estimate the $Q^\pi$ where $\phi$ is the parameters. To optimize the parameters, we can optimize the loss function of $L_i(\phi_i)$ at each PreActStage $i$ :

L_i(\phi_i) = \mathbb{E}_{s, a \sim \Tau} \left[ (y_i(s, a) - Q_{\phi_i}(s, a))^2 \right]

where $y_i$ is a bootstrap target:

y_i(s, a) = \mathbb{E}_{s\prime \sim \mathcal{T}(s, a)}[ r + \gamma \max_{a\prime \sim \pi(s\prime)} Q_{\phi_{i-1}}(s\prime, a\prime)]

Compared to TabularQLearning, we need the following modifications:

A CircularArraySARTTrajectory is used instead of VectorSARTTrajectory
The TabularApproximator is replaced with NeuralNetworkApproximator

Since DQN and its variants usually need to sample minibatches from a trajectory, inspired by adders and inserters in Acme, we also exported some AbstractSampler . Users may refer the source code for some extra enhancements like double-dqn and n-step dqn.

Prioritized DQN

The biggest challenge to implement Prioritized DQN is to create a new Trajectory. Fortunately, adding a new trace is pretty simple given that a Trajectory is merely a wrapper of a NamedTuple. To make sure each transition is inserted correctly, developers should carefully examine all the default implementations of update!(trajectory, policy, env, stage) and write your more specific ones when needed. For Prioritized DQN, we need to insert a default priority at the PostActStage. When updating the inner learner, we also need to update the priority trace.

IQN

For distributional reinforcement learning algorithms like C51 and IQN, the estimation is not the Q-values directly. But we can still fit them into the QBasedPolicy. The main change is to override the default implementation of RLBase.plan!((p::QBasedPolicy, env::AbstractEnv) to use the Q value distribution.

3.2 Policy Gradient

This article already gives a detailed explanation to a lot of policy gradient algorithms. Here we only describe some notable points to implement them. Some of the implemented policies (for example PPO) are very interesting and worth explaining them in detail in another blog later.

MultiThreadEnv

Except for some Monte Carlo based policy gradient algorithms (like Vanilla Policy Gradient), we usually roll out several simulations simultaneously and collect the transitions every $n$ steps to train the policies. So this special environment wrapper is introduced to run several copies of environments by leveraging multi-threading. This environment breaks the workflow we defined before because is_terminated(::MultiThreadEnv) doesn't return a bool value any more. Also, the PreEpisodeStage and PostEpisodeStage are not valid. So a more specific implementation is defined.

function RLCore._run(
    policy::AbstractPolicy,
    env::MultiThreadEnv,
    stop_condition,
    hook::AbstractHook,
)

    while true
        action = policy(env)
        policy(PRE_ACT_STAGE, env, action)
        hook(PRE_ACT_STAGE, policy, env, action)

        env(action)

        policy(POST_ACT_STAGE, env)
        hook(POST_ACT_STAGE, policy, env)

        if stop_condition(policy, env)
            break
        end
    end
end

And the corresponding updating to the trajectory is also provided by default.

3.3 Offline Reinforcement Learning

In offline reinforcement learning, we often assume the experience is prepared ahead. To adapt some of the above algorithms in the offline setting, we need to provide a more specific implementation of update!(policy, batch) and reuse it in update!(policy, trajectory, env stage) . For new offline algorithms, we only need to implement the following two methods:

RLBase.plan!((p::YourOfflinePolicy, env::AbstreactEnv)
update!(p::YourOfflinePolicy, batch)

3.4 Multi-Agent Reinforcement Learning

In our initial workflow, there's only one agent interacting with the environment. To expand it to the multi-agent setting, a policy wrapper of MultiAgentPolicy and MultiAgentHook is added. At each stage, it fetches necessary information and forward the env to its children. Then based on the current player of the env, it selects the right child and generate an action properly.

There are two MultiAgent cases, Sequential and Simultaneous. For Sequential environments, RLBase.next_player! and current_player must be implemented so that the Base.run loop knows the order of play. For Simultaneous, the working assumption is that all players provided in MultiAgentPolicy play every turn. Two basic examples are provided, TicTacToeEnv and RockPaperScissorsEnv.

Keep interfaces simple and minimal

Adding new APIs is very cheap, but soon you will be the only one who knows how to use them. Keeping APIs simple and minimal will force you rethink your existing design and come up with a more natural one. Actually, the multi-dispatch in Julia encourages you to generalize the interfaces as much as possible.
Design early and refactor frequently

Once started coding, things will evolve very quickly. Focusing on new features only will introduce a lot of ad-hoc workarounds on the old design. To keep the system flexible, always remember to reserve some time for refactoring.
”Premature optimization is the root of all evil!“

In most cases, the overall simplicity is more important than speed. With multiple dispatch, it's always relatively easy to come up with a more efficient implementation without loss of generality.
Maintain a set of reproducible experiments

For experiments that can be finished in minutes, we should add them into tests and make sure they produce the same result after each PR. For large experiments, it's a good practice to run them after each patch release.

5.2 Looking Forward

We've witnessed the fast development of the reinforcement learning field in the last several years. A lot of interesting algorithms are not included yet so contributions are always welcomed.

Corrections

If you see mistakes or want to suggest changes, please create an issue in the source repository.