General Pipeline for Offline Reinforcement Learning Evaluation Report

This is a technical report of the Summer OSPP project Establish a General Pipeline for Offline Reinforcement Learning Evaluation used for final term evaluation. It provides an overview of the work done during mid-term and the final evaluation phases.


Table of content

  1. Introduction
    1. Project Name
    2. Background
    3. Project Overview
      1. Objectives
    4. Time Planning
    5. Datasets
      1. Documentation
    6. Installation details
    7. D4RL
    8. d4rl-pybullet
    9. Google Research Atari DQN Replay Datasets
    10. RL Unplugged Atari Dataset
    11. Relevant commits, discussions and PRs
    12. Implementation Details and Challenges Faced
      1. Implementation details
        1. Directory Structure
    13. D4RL Datasets implementation
    14. RL Unplugged Atari
  2. Technical report (final term evaluation)
    1. Summary
    2. Completed Work
      1. Bsuite Datasets
        1. Ring Buffer
    3. Implementation
    4. Working
    5. DM Datasets
      1. Example of type handling used.
    6. Usage
    7. Deep OPE
      1. Implementation
    8. Working
    9. FQE
      1. Implementation
    10. Results
      1. Parameter Values
    11. Evaluation Results
    12. Actual Values
    13. Relevant Commits and PRs
    14. Conclusion
      1. Implications
    15. Future Scope

Introduction

Project Name

Establish a General Pipeline for Offline Reinforcement Learning Evaluation

Background

In recent years, there have been several breakthroughs in the field of Reinforcement Learning with numerous practical applications where RL bots have been able to achieve superhuman performance. This is also reflected in the industry where several cutting edge solutions have been developed based on RL (Tesla Motors, AutoML, DeepMind data center cooling solutions just to name a few).

One of the most prominent challenges in RL is the lack of reliable environments for training RL agents. Offline RL has played a pivotal role in solving this problem by removing the need for the agent to interact with the environment to improve its policy over time. This brings forth the problem of not having reliable tests to verify the performance of RL algorithms. Such tests are facilitated by standard datasets (RL Unplugged, D4RL and An Optimistic Perspective on Offline Reinforcement Learning) that are used to train Offline RL agents and benchmark against other algorithms and implementations. ReinforcementLearningDatasets.jl provides a simple solution to access various standard datasets that are available for Offline RL benchmarking across a variety of tasks.

Another problem in Offline RL is Offline Model Selection. For this, there are numerous policies that are available in Benchmarks for Deep Off-Policy Evaluation. ReinforcementLearningDatasets.jl will also help in loading policies that will aid in model selection in ReinforcementLearning.jl package.

Project Overview

Objectives

Create a package called ReinforcementLearningDatasets.jl that would aid in loading various standard datasets and policies that are available. Currently supported datasets are:

Make standard policies in Benchmarks for Deep Off-Policy Evaluation to be available in RLDatasets.jl.

Implement an Off Policy Evaluation method and select between a number of standard policies for a particular task using RLDatasets.jl.

The following are the future work that are possible in this project.

Refer the following discussion for more ideas.

Time Planning

DateGoals
07/01 - 07/14Brainstorm various ideas that are possible for the implementation of RLDatasets.jl and finalize the key features.
07/15 - 07/20Made a basic julia wrapper for d4rl environments and add some tests
07/21 - 07/30Implemented d4rl and d4rl-pybullet datasets
07/31 - 08/06Implemented Google Research DQN Replay Datasets
08/07 - 08/14Implemented RL Unplugged atari datasets, setup the docs, added README.md. Made the package more user friendly. Make the mid-term report
08/15 - 08/30Added bsuite datasets, polished the interface, finalized the structure of the codebase. Fixed problem with windows
09/01 - 09/15Added support for policy loading from Benchmarks for Deep Off-Policy Evaluation
09/16 - 09/30Researched about OPE methods, implemented FQE and test basic performance. Completed the final-term report

There are some changes to the original timeline based on a few time constraints but the basic objectives of the project are accomplished.

Datasets

Documentation

The documentation for this package is available in RLDatasets.jl. Do check it out for more details.

Installation details

To install the ReinforcementLearningDatasets.jl package use the following command in julia's pkg mode.

pkg> add https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl:src/ReinforcementLearningDatasets

D4RL

Added support for D4RL datasets with all features loaded in the returned type.

Credits: D4RL

using ReinforcementLearningDatasets
ds = dataset(
        "hopper-medium-replay-v0";
        repo="d4rl")

The type (D4RLDataSet) returned by dataset is an Iterator that returns batches of data based on the requirement that is specified.

Now, you could take the values of the ds or iterate over it.

julia> batches = Iterators.take(ds, 2)
D4RLDataSet{StableRNGs.LehmerRNG}(Dict{Symbol, Any}(:reward => Float32[0.9236555, 0.8713692, 0.92237693, 0.9839225, 0.91540813, 0.8331875, 0.8102179, 0.78385466, 0.7304337, 0.69426715.0350657, 5.005931, 4.998442, 4.986662, 4.9730926, 4.9638906, 4.9503803, 4.9326644, 4.8952913, 4.8448896], :state => Float32[1.2521756 1.25193510.72994494 0.7145643; 0.00026937472 -0.00489463420.13946348 0.15210924; … ; 0.002733759 -1.1853988 … -0.06101464 -0.045892276; -0.0028058232 0.08466121 … -1.4235892 -1.0558393], :action => Float32[-0.67060924 -0.39061046 … -0.15234122 -0.1382414; -0.9329903 0.659770970.9518685 0.9666188; 0.010210991 -0.0736856460.24721281 -0.2440847], :terminal => Int8[0, 0, 0, 0, 0, 0, 0, 0, 0, 00, 0, 0, 0, 0, 0, 0, 0, 0, 1]), "d4rl", 200919, 256, (:state, :action, :reward, :terminal, :next_state), StableRNGs.LehmerRNG(state=0x000000000000000000000000000000f7), Dict{String, Any}("timeouts" => Int8[0, 0, 0, 0, 0, 0, 0, 0, 0, 00, 0, 0, 0, 0, 0, 0, 0, 0, 0]), true)

julia> typeof(batches)
Base.Iterators.Take{D4RLDataSet{StableRNGs.LehmerRNG}}

julia> batch = collect(batches)[1]
NamedTuple{(:state, :action, :reward, :terminal, :next_state), Tuple{Matrix{Float32}, Matrix{Float32}, Vector{Float32}, Vector{Int8}, Matrix{Float32}}}

julia> size(batch[:state])
(11, 256)

d4rl-pybullet

Added support for datasets released in d4rl-pybullet. This enables testing the agents in complex environments without Mujoco license.

Credits: d4rl-pybullet

using ReinforcementLearningDatasets
ds = dataset(
        "hopper-bullet-mixed-v0";
        repo="d4rl-pybullet",
    )
samples = Iterators.take(ds, 2)

The output is similar to D4RL.

Google Research Atari DQN Replay Datasets

Added support for Google Research Atari DQN Replay Datasets. Currently, the datasets are directly loaded into the RAM and therefore, it is advised to be used only with sufficient amount of RAM (around 20 GB of free space). Support for lazy parallel loading in a Channel will be given soon.

Credits: DQN Replay Datasets

using ReinforcementLearningDatasets
ds = dataset(
        "pong",
        1,
        [1, 2]
    )
samples = Iterators.take(ds, 2)

The output is similar to D4RL.

RL Unplugged Atari Dataset

Added support for RL Unplugged atari datasets. The datasets that are stored in the form of .tfrecord are fetched into julia. Lazy loading with multi threading is implemented. This implementation is based on previous work in TFRecord.jl.

Credits: RL Unplugged

using ReinforcementLearningDatasets
ds = ds = rl_unplugged_atari_dataset(
        "Pong",
        1,
        [1, 2]
    )

The type that is returned is a Channel{AtariRLTransition} which returns batches with the given specifications from the buffer when take! is used. The point to be noted here is that it takes seconds to load the datasets into the Channel and the loading is highly customizable.

julia> ds = ds = rl_unplugged_atari_dataset(
               "Pong",
               1,
               [1, 2]
           )
[ Info: Loading the shards [1, 2] in 1 run of Pong with 4 threads
Progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:08
Channel{ReinforcementLearningDatasets.AtariRLTransition}(12) (12 items available)

It also supports lazy downloading of the datasets based on the shards that are required by the user. In this case only gs://rl_unplugged/atari/Pong/atari_Pong_run_1-00001-of-00100 and gs://rl_unplugged/atari/Pong/atari_Pong_run_1-00002-of-00100 will only be downloaded with permissions from the user. If it is already present the dataset is located using DataDeps.jl.

The loading time for batches is also very minimal.

julia> @time batch = take!(ds)
0.000011 seconds (1 allocation: 80 bytes)

julia> typeof(batch)
ReinforcementLearningDatasets.AtariRLTransition

julia> typeof(batch.state)
Array{UInt8, 4}

julia> size(batch.state)
(84, 84, 4, 256)

julia> size(batch.reward)
(256,)

Relevant commits, discussions and PRs

Implementation Details and Challenges Faced

The challenge that was faced during the first week was to chart out a direction for RLDatasets.jl. I researched the implementations of the pipeline in d3rlpy, TF.data.Dataset etc and then narrowed down some inspiring ideas in the discussion.

Later, I made the implementation as a wrapper around d4rl python library, which was discarded as it did not align with the purpose of the library of being lightweight and not requiring a Mujoco license for usage of open source datasets. A wrapper would also not give the fine grained control that we could get if we load the datasets natively.

We decided to use DataDeps.jl for registering, tracking and locating datasets without any hassle. DataDeps.jl is a package that helps make data wrangling code more reusable and was crucial in making RLDatasets.jl seamless.

What I learnt here was how to make a package, manage its dependencies and choose which package would be the right fit for the job. I also learnt about Iterator interfaces in julia to convert the type (that is output by the dataset function) into an Iterator. d4rl-pybullet was also implemented in a similar fashion.

Implementation of Google Research Atari DQN Replay Datasets was harder because it was quite a large dataset and even one shard didn't fit into memory of my machine. I also had to figure out how the data was stored and how to retrieve it. Initially, I planned to use GZip.jl to unpack the gzip files and use NPZ.jl to read the files. Since, NPZ didn't support reading from GZipStream by itself, I had to adapt the functions in NPZ to read the stream. Later, we decided to use CodecZlib to get a decompressed buffer channel output which was natively supported by NPZ. We also had to test it internally and skip the CI test because CI wouldn't be able to handle the dataset. Exploring the possibility of lazy loading of the files that are available and enabling it is also within the scope of the project.

For supporting RL Unplugged dataset I had to learn about .tfrecord files, Protocol Buffers, buffered Channels and julia multi threading which was used in a lot of occasions. It took some time to grasp all the concepts but the final implementation, however, was based on already existing work in TFRecord.jl.

All of this work wouldn't have been possible without the patient mentoring and vast knowledge of my mentor Jun Tian, who has been pivotal in the design and implementation of the package. His massive experience and beautifully written code has provided a lot of inspiration to the making of this package. His amicable nature and commitment to the users of the package by providing timely and detailed explanations to any issues or queries related to the package despite his time constraints, has provided a long standing example as a developer and as a person. I also thank all the developers of the packages that RLDatasets.jl depends upon.

Implementation details

Directory Structure

The src directory hosts the working logic of the package.

src
├─ ReinforcementLearningDatasets.jl
├─ atari
│  ├─ atari_dataset.jl
│  └─ register.jl
├─ common.jl
├─ d4rl
│  ├─ d4rl
│  │  └─ register.jl
│  ├─ d4rl_dataset.jl
│  └─ d4rl_pybullet
│     └─ register.jl
├─ init.jl
└─ rl_unplugged
   ├─ atari
   │  ├─ register.jl
   │  └─ rl_unplugged_atari.jl
   └─ util.jl

The directory for handling each dataset would consist of two files. The register.jl that would register the DataDeps that are required and another file that is responsible for loading the datasets. The init functions are called in the project __init__ for registering right after it is imported.

function __init__()
    RLDatasets.d4rl_init()
    RLDatasets.d4rl_pybullet_init()
    RLDatasets.atari_init()
    RLDatasets.rl_unplugged_atari_init()
end

D4RL Datasets implementation

The register.jl for d4rl dataset is located in src/d4rl/d4rl which registers the DataDeps. The following is an example code for the registration.

function d4rl_init()
    repo = "d4rl"
    for ds in keys(D4RL_DATASET_URLS)
        register(
            DataDep(
                repo*"-"* ds,
                """
                Credits: https://arxiv.org/abs/2004.07219
                The following dataset is fetched from the d4rl. 
                The dataset is fetched and modified in a form that is useful for RL.jl package.
                
                Dataset information: 
                Name: $(ds)
                $(if ds in keys(D4RL_REF_MAX_SCORE) "MAXIMUM_SCORE: " * string(D4RL_REF_MAX_SCORE[ds]) end)
                $(if ds in keys(D4RL_REF_MIN_SCORE) "MINIMUM_SCORE: " * string(D4RL_REF_MIN_SCORE[ds]) end) 
                """, #check if the MAX and MIN score part is even necessary and make the log file prettier
                D4RL_DATASET_URLS[ds],
            )
        )
    end
    nothing
end

The dataset is loaded using ReinforcementLearningDatasets/src/d4rl/d4rl_dataset.jl and is enclosed in a D4RLDataSet type.

struct D4RLDataSet{T<:AbstractRNG} <: RLDataSet
    dataset::Dict{Symbol, Any}
    repo::String
    dataset_size::Integer
    batchsize::Integer
    style::Tuple
    rng::T
    meta::Dict
    is_shuffle::Bool
end

The dataset function is used to retrieve the files.

function dataset(dataset::String;
    style=SARTS,
    repo = "d4rl",
    rng = StableRNG(123), 
    is_shuffle = true, 
    batchsize=256
)

The dataset is downloaded if the dataset is not present and loaded from the local file system using DataDeps.jl

try 
    @datadep_str repo*"-"*dataset 
catch 
    throw("The provided dataset is not available") 
end
    
path = @datadep_str repo*"-"*dataset 

@assert length(readdir(path)) == 1
file_name = readdir(path)[1]

data = h5open(path*"/"*file_name, "r") do file
    read(file)
end

The dataset is loaded into D4RLDataSet Iterator and returned. The iteration logic is also implemented in the same file using Iterator interfaces.

RL Unplugged Atari

Some of the interesting pieces of code used in loading RL Unplugged dataset.

Multi threaded iteration over a Channel{Example} to put! into another Channel{AtariRLTransition}.

ch_src = Channel{AtariRLTransition}(n * tf_reader_sz) do ch
    for fs in partition(shuffled_files, n)
        Threads.foreach(
            TFRecord.read(
                fs;
                compression=:gzip,
                bufsize=tf_reader_bufsize,
                channel_size=tf_reader_sz,
            );
            schedule=Threads.StaticSchedule()
        ) do x
            put!(ch, AtariRLTransition(x))
        end
    end
end

Multi threaded batching using a parallel loop where each thread loads the batches into Channel{AtariRLTransition}.

res = Channel{AtariRLTransition}(n_preallocations; taskref=taskref, spawn=true) do ch
    Threads.@threads for i in 1:batchsize
        put!(ch, deepcopy(batch(buffer_template, popfirst!(transitions), i)))
    end
end

Technical report (final term evaluation)

The following is the final term evaluation report of "General Pipeline for Offline Reinforcement Learning Evaluation Report" in OSPP. Details of all the work that has been done after the mid-term evaluation and some explanation on the current status of the package are given. Some exciting work that is possible based on this project is also given.

Summary

Completed Work

The following work has been done post mid-term evaluation.

Bsuite Datasets

It involved work similar to RL Unplugged Atari Datasets which involves multi threaded dataloading. It is implemented using a Ring Buffer for storing and loading batches of data.

Ring Buffer

Huge thanks to Jun Tian for the implementation.

The RingBuffer is based on having two Channels, one for holding the buffer that contains empty batches (buffers) that can be used later for making batches with data. The results Channel is used for holding batches with data. The current holds the latest result.

mutable struct RingBuffer{T} <: AbstractChannel{T}
    buffers::Channel{T}
    current::T
    results::Channel{T}
end

The RingBuffer is created using the following code. It creates an empty buffers Channel that is then used to fill up results by performing inplace operations for making batches with data.

function RingBuffer(f!, buffer::T;sz=Threads.nthreads(), taskref=nothing) where T
    buffers = Channel{T}(sz)
    for _ in 1:sz
        put!(buffers, deepcopy(buffer))
    end
    results = Channel{T}(sz, spawn=true, taskref=taskref) do ch
        Threads.foreach(buffers;schedule=Threads.StaticSchedule()) do x
        # for x in buffers
            f!(x)  # in-place operation
            put!(ch, x)
        end
    end
    RingBuffer(buffers, buffer, results)
end

Whenever a batch is taken from the buffer, the following code gets called.

function Base.take!(b::RingBuffer)
    put!(b.buffers, b.current)
    b.current = take!(b.results)
    b.current
end

I would have implemented a simpler one channel buffer, but RingBuffer proved to be more effective.

Implementation

The files are read and the datapoints are put in a Channel.

ch_src = Channel{BSuiteRLTransition}(n * tf_reader_sz) do ch
    for fs in partition(files, n)
        Threads.foreach(
            TFRecord.read(
                fs;
                compression=:gzip,
                bufsize=tf_reader_bufsize,
                channel_size=tf_reader_sz,
            );
            schedule=Threads.StaticSchedule()
        ) do x
            put!(ch, BSuiteRLTransition(x, game))
        end
    end
end

The datapoints are then put in a RingBuffer which is returned.

res = RingBuffer(buffer;taskref=taskref, sz=n_preallocations) do buff
    Threads.@threads for i in 1:batchsize
        batch!(buff, take!(transitions), i)
    end
end

Working

The bsuite_params function can be used to get the possible arguments that can be passed into the function.

julia> bsuite_params()
┌ Info: ["cartpole", "catch", "mountain_car"]
│   shards = 0:4
└   type = 3-element Vector{String}: …

To get the dataset rl_unplugged_bsuite_dataset function can be called.

julia> rl_unplugged_bsuite_dataset("cartpole", [1], "full")
Progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:06
RingBuffer{ReinforcementLearningDatasets.BSuiteRLTransition}(Channel{ReinforcementLearningDatasets.BSuiteRLTransition}(12), ReinforcementLearningDatasets.BSuiteRLTransition(Float32[1.0f-45 1.0f-45NaN 0.0; 0.0 0.0NaN 0.0; … ; 3.0f-45 4.0f-45NaN 9.0f-44; 0.0 0.0NaN 0.0], [140269590665616, 140269590665616, 140269590665616, 
...
julia> take!(ds)
ReinforcementLearningDatasets.BSuiteRLTransition(Float32[-0.3289344 0.26131696 … -0.015311318 0.49089232; 0.31783995 -1.20334450.04303875 -0.24614102; … ; -0.27250051 1.0421202 … -0.17690773 0.2694671; 0.697 0.8050.009 0.955], [0, 0, 0, 0, 0, 1, 1, 0, 0, 12, 0, 1, 0, 2, 0,
...

DM Datasets

The DM datasets load and work similarly to bsuite datasets. Since, I made one file to manage DM Control, DM Lab and DM Locomotion, there had to be a lot of post processing work to handle all the edge cases presented by each of the dataset.

The types also had to be created based on the individual datasets so that the code is good at loading efficiently.

Example of type handling used.

function make_transition(example::TFRecord.Example, feature_size::Dict{String, Tuple})
    f = example.features.feature
    
    observation_dict = Dict{Symbol, AbstractArray}()
    next_observation_dict = Dict{Symbol, AbstractArray}()
    transition_dict = Dict{Symbol, Any}()

    for feature in keys(feature_size)
        if split(feature, "/")[1] == "observation"
            ob_key = Symbol(chop(feature, head = length("observation")+1, tail=0))
            if split(feature, "/")[end] == "egocentric_camera"
                cam_feature_size = feature_size[feature]
                ob_size = prod(cam_feature_size)
                observation_dict[ob_key] = reshape(f[feature].bytes_list.value[1][1:ob_size], cam_feature_size...)
                next_observation_dict[ob_key] = reshape(f[feature].bytes_list.value[1][ob_size+1:end], cam_feature_size...)
            else
                if feature_size[feature] == ()
                    observation_dict[ob_key] = f[feature].float_list.value
                else
                    ob_size = feature_size[feature][1]
                    observation_dict[ob_key] = f[feature].float_list.value[1:ob_size]
                    next_observation_dict[ob_key] = f[feature].float_list.value[ob_size+1:end]
                end
            end
        elseif feature == "action"
            ob_size = feature_size[feature][1]
            action = f[feature].float_list.value
            transition_dict[:action] = action[1:ob_size]
            transition_dict[:next_action] = action[ob_size+1:end]
        elseif feature == "step_type"
            transition_dict[:terminal] = f[feature].float_list.value[1] == 2
        else
            ob_key = Symbol(feature)
            transition_dict[ob_key] = f[feature].float_list.value[1]
        end
    end
    state_nt = (state = NamedTuple(observation_dict),)
    next_state_nt = (next_state = NamedTuple(next_observation_dict),)
    transition = NamedTuple(transition_dict)

    merge(transition, state_nt, next_state_nt)
end

The batch is made based on the internal types that are available based on the specific dataset.

Usage

julia> rl_unplugged_dm_dataset("fish_swim", [1]; type="dm_control_suite")
Progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:02
RingBuffer{NamedTuple{(:reward, :episodic_reward, :discount, :state, :next_state, :action, :next_action, :terminal), Tuple{Vector{Float32}, Vector{Float32}, Vector{Float32}, NamedTuple{(:joint_angles, :upright, :target, :velocity), NTuple{4, Matrix{Float32}}}, NamedTuple{(:joint_angles, :upright, :target, :velocity), NTuple{4, Matrix{Float32}}}, Matrix{Float32}, Matrix{Float32}, Vector{Bool}}}}(Channel{NamedTuple{(:reward, :episodic_reward, :discount, :state, :next_state, :action, :next_action, :terminal), Tuple{Vector{Float32}, Vector{Float32}, Vector{Float32}, NamedTuple{(:joint_angles, :upright, :target, :velocity), NTuple{4, Matrix{Float32}}}, NamedTuple{(:joint_angles, :upright, :target, :velocity), NTuple{4, Matrix{Float32}}}, Matrix{Float32}, Matrix{Float32}, Vector{Bool}}}}(12), (reward = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -3.4009038f-36, 4.5764f-410.0, 0.0, 2.3634088f-28, 4.5765f-41, 1.1692183f-34, 4.5764f-41, 0.0, 0.0,
...

Deep OPE

Support is given for D4RL policies provided in Deep OPE.

Implementation

The policies that are given here are loaded using d4rl_policy function.

The policies are loaded into a D4RLGaussianNetwork which will be integrated into GaussianNetwork in RLCore soon.

Base.@kwdef struct D4RLGaussianNetwork{P,U,S}
    pre::P = identity
    μ::U
    logσ::S
end

The network returns the a, μ based on the parameters that are passed into it.

function (model::D4RLGaussianNetwork)(
    state::AbstractArray;
    rng::AbstractRNG=MersenneTwister(123), 
    noisy::Bool=true
)
    x = model.pre(state)
    μ, logσ = model.μ(x), model.logσ(x)
    if noisy
        a = μ + exp.(logσ) .* Float32.(randn(rng, size(μ)))
    else
        a = μ + exp.(logσ)
    end
    a, μ
end

The weights are loaded using the following code.

To know the real life performance of the networks an auxiliary function deep_ope_d4rl_evaluate is also given which gives the unicode plot showing the performance of the policy. The code is given here.

Working

The params needed for loading the policies can be obtained using d4rl_policy_params

julia> d4rl_policy_params()
┌ Info: Set(["relocate", "maze2d_large", "antmaze_umaze", "hopper", "pen", "antmaze_medium", "walker", "hammer", "antmaze_large", "maze2d_umaze", "maze2d_medium", "ant", "door", "halfcheetah"])
│   agent = 2-element Vector{String}: …
└   epoch = 0:10

Sometimes the expected policy may not be available. So, it is always better to check which ones are available using ReinforcementLearningDatasets.D4RL_POLICIES.

julia> policy = d4rl_policy("hopper", "online", 3)
D4RLGaussianNetwork{Flux.Chain{Tuple{Flux.Dense{typeof(NNlib.relu), Matrix{Float32}, Vector{Float32}}, Flux.Dense{typeof(NNlib.relu), Matrix{Float32}, Vector{Float32}}}}, Flux.Chain{Tuple{Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, Flux.Chain{Tuple{Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}}(Chain(Dense(11, 256, relu), Dense(256, 256, relu)), Chain(Dense(256, 3)), Chain(Dense(256, 3)))

The policy will return the a and μ for a state that is given.

deep_ope_d4rl_evaluate is a helper function that helps visualize the performance of the agent. The more the epoch number, the better the performance.

julia> deep_ope_d4rl_evaluate("halfcheetah", "online", 3)
Progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:01
                    halfcheetah-medium-v0 scores
              +----------------------------------------+       
         8100 |                                       .| scores
              |                                      .'|       
              |        .                           .'  |       
              |      .' :                         .'   |       
              |    .'   :       :.               :     |       
              |  .'      :     .' :            .'      |       
              |.'        '.    :   :          .'       |       
   score      |           :   :     '.       .'        |       
              |           '. .'      '.      :         |       
              |            : :        :     :          |       
              |            ':          :   :           |       
              |                         : .'           |       
              |                         '.:            |       
              |                          '             |       
         7400 |                                        |       
              +----------------------------------------+       
              1                                       10
                               episode
julia> deep_ope_d4rl_evaluate("halfcheetah", "online", 10)
Progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:01
                     halfcheetah-medium-v0 scores
               +----------------------------------------+       
         12000 |                 .                      | scores
               |''''..........''''.       :.       :''''|       
               |                  :       :'.     :     |       
               |                  :      .' '.   :      |       
               |                  '.     :   :  :       |       
               |                   :     :    ::        |       
               |                   :    .'     '        |       
   score       |                   '.   :               |       
               |                    :   :               |       
               |                    :  .'               |       
               |                    '. :                |       
               |                     : :                |       
               |                     :.'                |       
               |                     ::                 |       
          2000 |                      :                 |       
               +----------------------------------------+       
               1                                       10
                                episode

FQE

A major amount of time was spent on researching about OPE methods of which FQE was the most appropriate given that the use case is Deep Reinforcement Learning.

Batch Policy Learning under Constraints introduces the FQE and uses it for offline reinforcement learning under constraints and achieves remarkable results by calculating new constraint cost functions with the datasets. The algorithm that is implemented in RLZoo is similar to the one that is proposed here.

The implementation in RLZoo is based on Hyperparameter Selection for Offline Reinforcement Learning. This is very similar to the algorithm that we discussed earlier. The paper uses OPE as a method for offline hyper paramater selection.

The average of values calculated by FQE based on initial states can be taken as the reward that the policy would gain from the environment. So, the same can be used for online hyper parameter selection.

The pseudocode for the implementation and the objective function are as follows.

Implementation

Function parameters for the implementation.

mutable struct FQE{
    P<:GaussianNetwork,
    C<:NeuralNetworkApproximator,
    C_T<:NeuralNetworkApproximator,
    R<:AbstractRNG,
 } <: AbstractLearner
    policy::P
    q_network::C
    target_q_network::C_T
    n_evals::Int
    γ::Float32
    batchsize::Int
    update_freq::Int
    update_step::Int
    tar_update_freq::Int
    rng::R
    #logging
    loss::Float32
end

The update function for the learner is a simple critic update based on the following procedure.

function RLBase.update!(l::FQE, batch::NamedTuple{SARTS})
    policy = l.policy
    Q, Qₜ = l.q_network, l.target_q_network

    D = device(Q)
    s, a, r, t, s′ = (send_to_device(D, batch[x]) for x in SARTS)
    γ = l.γ
    batchsize = l.batchsize

    loss_func = Flux.Losses.mse

    q′ = Qₜ(vcat(s′, policy(s′)[1])) |> vec

    target = r .+ γ .* (1 .- t) .* q′

    gs = gradient(params(Q)) do
        q = Q(vcat(s, reshape(a, :, batchsize))) |> vec
        loss = loss_func(q, target)
        Zygote.ignore() do
            l.loss = loss
        end
        loss
    end
    Flux.Optimise.update!(Q.optimizer, params(Q), gs)

    if l.update_step % l.tar_update_freq == 0
        Qₜ = deepcopy(Q)
    end
end

Results

The implementation is still a work in progress because of some sampling error. But the algorithm that I implemented without RL.jl framework works as expected.

Parameter Values
Evaluation Results

The values evaluated by FQE for 100 initial states.

mean=-243.0258f0

Actual Values

The values obtained by running the agent in the environment for 100 iterations.

mean=-265.7068139137983

Relevant Commits and PRs

Conclusion

The foundations of RLDatasets.jl package has been laid during the course of the project. The basic datasets except for Real World Datasets from RL Unplugged have been supported. Furthermore, D4RL policies have been successfully loaded and tested. The algorithm for FQE has been tried out with a minor implementation detail pending.

With the completion of FQE the four requirements of OPE as laid out by Deep OPE will be completed for D4RL.

Implications

Equipping RL.jl with RLDatasets.jl is a key step in making the package more industry relevant because different offline algorithms can be compared with respect to a variety of standard offline dataset benchmarks. It is also meant to improve the implementations of existing offline algorithms and make it on par with the SOTA implementations. This package provides a seamless way of downloading and accessing existing datasets and also supports loading datasets into memory with ease, which if implemented separately, would be tedious for the user. It also incorporates policies that can be useful for testing Off Policy Evaluation Methods.

Future Scope

There are several exciting work that are possible from this point.

Corrections

If you see mistakes or want to suggest changes, please create an issue in the source repository.