How to Write a Customized Environment in ReinforcementLearning.jl?

8.2 μs
2.4 ms
  • Last Update: 2021-01-30T21:08:01.778

  • Julia Version: v"1.5.3"

  • ReinforcementLearning.jl Version: v"0.8.0"

103 ms

The first step to apply algorithms in ReinforcementLearning.jl is to define the problem you want to solve in a recognizable way. Here we'll demonstrate how to write many different kinds of environments based on interfaces defined in ReinforcementLearningBase.jl

The most commonly used interfaces to describe reinforcement learning tasks is OpenAI/Gym. Inspired by it, we expand those interfaces a little to utilize the multiple-dispatch in Julia and to cover multi-agent environments.

The Minimal Interfaces to Implement

Many interfaces in ReinforcementLearningBase.jl have a default implementation. So in most cases, you only need to implement the following functions to define a customized environment:

action_space(env::YourEnv)
state(env::YourEnv)
state_space(env::YourEnv)
reward(env::YourEnv)
is_terminated(env::YourEnv)
reset!(env::YourEnv)
(env::YourEnv)(action)

An Example: The LotteryEnv

Here we use an example introduced in Monte Carlo Tree Search: A Tutorial to demonstrate how to write a simple environment.

The game is defined like this: assume you have $10 in your pocket, and you are faced with the following three choices:

  1. Buy a PowerRich lottery ticket (win $100M w.p. 0.01; nothing otherwise);

  2. Buy a MegaHaul lottery ticket (win $1M w.p. 0.05; nothing otherwise);

  3. Do not buy a lottery ticket.

This game is a one-shot game. It terminates immediately after taking an action and a reward is received. First we define a concrete subtype of AbstractEnv named LotteryEnv:

2.8 ms
39.1 s
LotteryEnv
5.1 ms

LotteryEnv has only one field named reward, by default it is initialized with nothing. Now let's implement the necessary interfaces:

8.1 μs
32.3 μs

Here RLBase is just an alias for ReinforcementLearningBase.

3.3 μs
69.4 μs

Because the lottery game is just a simple one-shot game. If the reward is nothing then the game is not started yet and we say the game is in state false, otherwise the game is terminated and the state is true. So the result of state_space(env) describes the possible states of this environment. By reset! the game, we simply assign the reward with nothing, meaning that it's in the initial state again.

The only left one is to implement the game logic:

4.1 μs
213 μs

Test Your Environment

A method named RLBase.test_runnable! is provided to rollout several simulations and see whether the environment we defined is functional.

3.8 μs
env
# LotteryEnv

## Traits

| Trait Type        |                                            Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle     |          ReinforcementLearningBase.SingleAgent() |
| DynamicStyle      |           ReinforcementLearningBase.Sequential() |
| InformationStyle  | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle       |           ReinforcementLearningBase.Stochastic() |
| RewardStyle       |           ReinforcementLearningBase.StepReward() |
| UtilityStyle      |           ReinforcementLearningBase.GeneralSum() |
| ActionStyle       |     ReinforcementLearningBase.MinimalActionSet() |
| StateStyle        |     ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle |     ReinforcementLearningBase.Observation{Any}() |

## Is Environment Terminated?

No

## State Space

`Bool[0, 1]`

## Action Space

`(:PowerRich, :MegaHaul, nothing)`

## Current State

```
false
```
2.0 ms
1.4 s

It is a simple smell test which works like this:

for _ in 1:n_episode
    reset!(env)
    while !is_terminated(env)
        env |> action_space |> rand |> env
    end
end
3.0 μs

One step further is to test that other components in ReinforcementLearning.jl also work. Similar to the test above, let's try the RandomPolicy first:

6.4 μs
25.0 ms
425 ms

If no error shows up, then it means our environment at least works with the RandomPolicy 🎉🎉🎉. Next, we can add a hook to collect the reward in each episode to see the performance of the RandomPolicy.

3.8 μs
8.7 s
6.2 s

A random policy is usually not very meaningful. Here we'll use a tabular based monte carlo method to estimate the state-action value. (You may choose appropriate algorithms based on the problem you're dealing with.)

2.6 μs
27.7 ms
p
QBasedPolicy
├─ learner => MonteCarloLearner
│  ├─ approximator => TabularApproximator
│  │  ├─ table => 3×2 Array{Float64,2}
│  │  └─ optimizer => InvDecay
│  │     ├─ gamma => 1.0
│  │     └─ state => IdDict
│  ├─ γ => 1.0
│  ├─ kind => ReinforcementLearningZoo.FirstVisit
│  └─ sampling => ReinforcementLearningZoo.NoSampling
└─ explorer => EpsilonGreedyExplorer
   ├─ ϵ_stable => 0.1
   ├─ ϵ_init => 1.0
   ├─ warmup_steps => 0
   ├─ decay_steps => 0
   ├─ step => 1
   ├─ rng => Random._GLOBAL_RNG
   └─ is_training => true
103 ms

MethodError: no method matching (::ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay})(::Bool)

Closest candidates are:

Any(!Matched::Int64) at /home/tj/.julia/packages/ReinforcementLearningCore/LcIgw/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:30

Any(!Matched::Int64, !Matched::Int64) at /home/tj/.julia/packages/ReinforcementLearningCore/LcIgw/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:31

  1. (::ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling})(::Bool)@monte_carlo_learner.jl:45
  2. (::ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling})(::Main.workspace3.LotteryEnv)@monte_carlo_learner.jl:44
  3. (::ReinforcementLearningCore.QBasedPolicy{ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling},ReinforcementLearningCore.EpsilonGreedyExplorer{:linear,false,Random._GLOBAL_RNG}})(::Main.workspace3.LotteryEnv, ::ReinforcementLearningBase.MinimalActionSet, ::Tuple{Symbol,Symbol,Nothing})@q_based_policy.jl:27
  4. (::ReinforcementLearningCore.QBasedPolicy{ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling},ReinforcementLearningCore.EpsilonGreedyExplorer{:linear,false,Random._GLOBAL_RNG}})(::Main.workspace3.LotteryEnv)@q_based_policy.jl:21
  5. top-level scope@Local: 1[inlined]
---

Oops, we get an error here. So what does it mean?

Before answering this question, let's spend some time on understanding the policy we defined above. A QBasedPolicy contains two parts: a learner and an explorer. The learner learn the state-action value function (aka Q function) duiring interactions with the env. The explorer is used to select an action based on the Q value returned by the learner. Here the EpsilonGreedyExplorer(0.1) will select the action of the largest value with probability 0.9 and select a random one with probability 0.1. Inside of the MonteCarloLearner, a TabularQApproximator is used to estimate the Q value.

That's the problem! A TabularQApproximator only accepts states of type Int.

1.8 ms
0.0
768 ns
478 ns

MethodError: no method matching (::ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay})(::Bool)

Closest candidates are:

Any(!Matched::Int64) at /home/tj/.julia/packages/ReinforcementLearningCore/LcIgw/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:30

Any(!Matched::Int64, !Matched::Int64) at /home/tj/.julia/packages/ReinforcementLearningCore/LcIgw/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:31

  1. top-level scope@Local: 1[inlined]
---

OK, now we know where the problem is. But how to fix it?

A initial idea is to rewrite the RLBase.state(env::LotteryEnv) function to force it return an Int. That's workable. But in some cases, we may be using environments written by others and it's not very easy to modify the code directly. Fortunatelly, some built-in wrappers are provided to help us transform the environment.

11.2 μs
wrapped_env
# LotteryEnv |> StateOverriddenEnv |> ActionTransformedEnv

## Traits

| Trait Type        |                                            Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle     |          ReinforcementLearningBase.SingleAgent() |
| DynamicStyle      |           ReinforcementLearningBase.Sequential() |
| InformationStyle  | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle       |           ReinforcementLearningBase.Stochastic() |
| RewardStyle       |           ReinforcementLearningBase.StepReward() |
| UtilityStyle      |           ReinforcementLearningBase.GeneralSum() |
| ActionStyle       |     ReinforcementLearningBase.MinimalActionSet() |
| StateStyle        |     ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle |     ReinforcementLearningBase.Observation{Any}() |

## Is Environment Terminated?

Yes

## State Space

`Bool[0, 1]`

## Action Space

`Base.OneTo(3)`

## Current State

```
1
```
17.5 ms
1
32.1 ms

Nice job! Now we are ready to run the experiment:

3.4 μs
53.8 ms

If you are observant enough, you'll find that our policy is not updating at all!!!

2.8 μs
3×2 Array{Float64,2}:
 0.0  0.0
 0.0  0.0
 0.0  0.0
163 ns

Well, actually the policy is running in the evaluation mode here. We'll explain it in another blog. For now, you only need to know that we can wrap the policy in an Agent to train the policy.

4.8 μs
agent
Agent
├─ policy => QBasedPolicy
│  ├─ learner => MonteCarloLearner
│  │  ├─ approximator => TabularApproximator
│  │  │  ├─ table => 3×2 Array{Float64,2}
│  │  │  └─ optimizer => InvDecay
│  │  │     ├─ gamma => 1.0
│  │  │     └─ state => IdDict
│  │  ├─ γ => 1.0
│  │  ├─ kind => ReinforcementLearningZoo.FirstVisit
│  │  └─ sampling => ReinforcementLearningZoo.NoSampling
│  └─ explorer => EpsilonGreedyExplorer
│     ├─ ϵ_stable => 0.1
│     ├─ ϵ_init => 1.0
│     ├─ warmup_steps => 0
│     ├─ decay_steps => 0
│     ├─ step => 1002
│     ├─ rng => Random._GLOBAL_RNG
│     └─ is_training => true
└─ trajectory => Trajectory
   └─ traces => NamedTuple
      ├─ state => 0-element Array{Int64,1}
      ├─ action => 0-element Array{Int64,1}
      ├─ reward => 0-element Array{Float32,1}
      └─ terminal => 0-element Array{Bool,1}
88.7 ms
new_hook
51.2 μs
723 ms
3×2 Array{Float64,2}:
 0.0      1.00773e6
 0.0  47660.8
 0.0      0.0
192 ns

Note

Always remember that each algorithm usually only works in some specific environments, just like the `QBasedPolicy` above. So choose the right tool wisely 😉.
4.0 μs

More Complicated Environments

The above LotteryEnv is quite simple. Many environments we are interested in fall in the same category. Beyond that, there're still many other kinds of environments. You may take a glimpse at the table to see how many different types of environments are supported in ReinforcementLearningZoo.jl.

To distinguish different kinds of environments, some common traits are defined in ReinforcementLearningBase.jl. Now we'll explain them one-by-one.

StateStyle

In the above LotteryEnv, state(env::LotteryEnv) simply returns a true or false. But in some other environments, the function name state may be kind of vague. People with different background often talk about the same thing with different names. You may be interested in this discussion: What is the difference between an observation and a state in reinforcement learning? To avoid confusion when executing state(env), the environment designer can explicitly define state(::AbstractStateStyle, env::YourEnv). So that users can fetch necessary information on demand. Following are some built-in state styles:

11.9 μs
105 ms

Note that every state style may have different representations, String, Array, Graph and so on. All the above state styles can accept a data type as parameter. For example:

8.6 μs
32.0 μs

For environments which support many different kinds of states, developers should specify all the supported state styles. For example:

3.1 μs
8.7 ms
83.0 ns
1
8.0 ms
2
3.5 ms
1
640 ns
91.0 ns

DefaultStateStyle

The DefaultStateStyle trait returns the first element in the result of StateStyle by default.

For algorithm developers, they usually don't care about the state style. They can assume that the default state style is always well defined and simply call state(env) to get the right representation. So for environments of many different representations, state(env) will be dispatched to state(DefaultStateStyle(env), env). And we can use the DefaultStateStyleEnv wrapper to override the pre-defined DefaultStateStyle(::YourEnv).

7.6 μs

RewardStyle

For games like Chess, Go or many card game, we only get the reward at the end of an game. We say this kind of games is of TerminalReward, otherwise we define it as StepReward. Actually the TerminalReward is a special case of StepReward (for non-terminal steps, the reward is 0). The reason we still want to distinguish these two cases is that, for some algorithms there may be a more efficient implementation for TerminalReward style games.

4.4 μs
82.0 ns
16.5 ms

ActionStyle

For some environments, the valid actions in each step may be different. We call this kind of environments are of FullActionSet. Otherwise, we say the environment is of MinimalActionSet. A typical built-in environment with FullActionSet is the TicTacToeEnv. Two extra methods must be implemented:

4.2 μs
115 ms
71.0 ns
2.5 μs
1.1 μs

NumAgentStyle

In the above LotteryEnv, only one player is involved in the environment. In many board games, usually multiple players are engaged.

4.0 μs
86.0 ns
74.0 ns

For multi-agent environments, some new APIs are introduced. The meaning of some APIs we've seen are also extended.

First, multi-agent environment developers must implement players to distinguish different players.

3.7 μs
73.0 ns
281 ns
Single AgentMulti-Agent
state(env)state(env, player)
reward(env)reward(env, player)
env(action)env(action, player)
action_space(env)action_space(env, player)
state_space(env)state_space(env, player)
is_terminated(env)is_terminated(env, player)

Note that the APIs in single agent is still valid, only that they all fall back to the perspective from the current_player(env).

11.1 μs

UtilityStyle

In multi-agent environments, sometimes the sum of rewards from all players are always 0. We call the UtilityStyle of these environments ZeroSum. ZeroSum is a special case of ConstantSum. In cooperational games, the reward of each player are the same. In this case, they are called IdenticalUtility. Other cases fall back to GeneralSum.

8.5 μs

InformationStyle

If all players can see the same state, then we say the InformationStyle of these environments are of PerfectInformation. They are a special case of ImperfectInformation environments.

7.4 μs

DynamicStyle

All the environments we've seen so far were of Sequential style, meaning that at each step, only ONE player was allowed to take an action. Alternatively there are Simultaneous environments, where all the players take actions simultaneously without seeing each other's action in advance. Simultaneous environments must take a collection of actions from different players as input.

5.8 μs
2.4 ms
15.7 μs
true
26.6 ms

ChanceStyle

If there's no rng in the environment, everything is deterministic afer taking each action, then we call the ChanceStyle of these environments are of Deterministic. Otherwise, we call them Stochastic. One special case is that, in Extensive Form Games, a chance node is envolved. And the action probability of this special player is known. For these environments, we need to have the following methods defined:

6.0 μs
4.0 ms
90.0 ns
2.1 μs
true
1.9 μs

Examples

Finally we've gone through all the details you need to know for how to write a customized environment. You're encouraged to take a look at the examples provided in ReinforcementLearningEnvironments.jl. Feel free to create an issue there if you're still not sure how to describe your problem with the interfaces defined in this package.

4.5 μs