# ReinforcementLearningEnvironments.jl

## Built-in Environments

Traits | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

ActionStyle | MinimalActionSet | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||

FullActionSet | ✔ | ✔ | ||||||||||||

ChanceStyle | Stochastic | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||||||

Deterministic | ✔ | ✔ | ✔ | |||||||||||

ExplicitStochastic | ✔ | ✔ | ✔ | |||||||||||

DefaultStateStyle | Observation | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||

InformationSet | ✔ | ✔ | ||||||||||||

DynamicStyle | Simultaneous | ✔ | ||||||||||||

Sequential | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||

InformationStyle | PerfectInformation | ✔ | ✔ | ✔ | ||||||||||

ImperfectInformation | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||||

NumAgentStyle | MultiAgent | ✔ | ✔ | ✔ | ✔ | ✔ | ||||||||

SingleAgent | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||||||

RewardStyle | TerminalReward | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |||||

StepReward | ✔ | ✔ | ✔ | ✔ | ✔ | |||||||||

StateStyle | Observation | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||

InformationSet | ✔ | ✔ | ||||||||||||

InternalState | ✔ | |||||||||||||

UtilityStyle | GeneralSum | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |||||

ZeroSum | ✔ | ✔ | ✔ | |||||||||||

ConstantSum | ✔ | |||||||||||||

IdenticalUtility | ✔ |

- MultiArmBanditsEnv
- RandomWalk1D
- TigerProblemEnv
- MontyHallEnv
- RockPaperScissorsEnv
- TicTacToeEnv
- TinyHanabiEnv
- PigEnv
- KuhnPokerEnv
- AcrobotEnv
- CartPoleEnv
- MountainCarEnv
- PendulumEnv

**Note**: Many traits are *borrowed* from OpenSpiel.

## 3-rd Party Environments

Environment Name | Dependent Package Name | Description |
---|---|---|

`AtariEnv` | ArcadeLearningEnvironment.jl | |

`GymEnv` | PyCall.jl | |

`OpenSpielEnv` | OpenSpiel.jl | |

`SnakeGameEnv` | SnakeGames.jl | `SingleAgent` /`Multi-Agent` , `FullActionSet` /`MinimalActionSet` |

#list-of-environments | GridWorlds.jl | Environments in this package support the interfaces defined in `RLBase` |

`ReinforcementLearningEnvironments.KUHN_POKER_REWARD_TABLE`

— Constant`ReinforcementLearningEnvironments.ActionTransformedEnv`

— Method`ActionTransformedEnv(env;action_space_mapping=identity, action_mapping=identity)`

`action_space_mapping`

will be applied to `action_space(env)`

and `legal_action_space(env)`

. `action_mapping`

will be applied to `action`

before feeding it into `env`

.

`ReinforcementLearningEnvironments.BitFlippingEnv`

— TypeIn Bit Flipping Environment we have n bits. The actions are 1 to n where executing i-th action flips the i-th bit of the state. For every episode we sample uniformly and initial state as well as the target state.

Refer Hindsight Experience Replay paper for the motivation behind the environment.

`ReinforcementLearningEnvironments.CartPoleEnv`

— Method`CartPoleEnv(;kwargs...)`

**Keyword arguments**

`T = Float64`

`continuous = false`

`rng = Random.default_rng()`

`gravity = T(9.8)`

`masscart = T(1.0)`

`masspole = T(0.1)`

`halflength = T(0.5)`

`forcemag = T(10.0)`

`max_steps = 200`

`dt = 0.02`

`thetathreshold = 12.0 # degrees`

`xthreshold`

= 2.4`

`ReinforcementLearningEnvironments.DefaultStateStyleEnv`

— Method`DefaultStateStyleEnv{S}(env::E)`

Reset the result of `DefaultStateStyle`

without changing the original behavior.

`ReinforcementLearningEnvironments.GraphShortestPathEnv`

— Type`GraphShortestPathEnv([rng]; n=10, sparsity=0.1, max_steps=10)`

Quoted **A.3** in the the paper Decision Transformer: Reinforcement Learning via Sequence Modeling.

We give details of the illustrative example discussed in the introduction. The task is to find theshortest path on a fixed directed graph, which can be formulated as an MDP where reward is0whenthe agent is at the goal node and−1otherwise. The observation is the integer index of the graphnode the agent is in. The action is the integer index of the graph node to move to next. The transitiondynamics transport the agent to the action’s node index if there is an edge in the graph, while theagent remains at the past node otherwise. The returns-to-go in this problem correspond to negativepath lengths and maximizing them corresponds to generating shortest paths.

`ReinforcementLearningEnvironments.KuhnPokerEnv`

— Method`KuhnPokerEnv()`

See more detailed description here.

Here we demonstrate how to write a typical `ZERO_SUM`

, `IMPERFECT_INFORMATION`

game. The implementation here has a explicit `CHANCE_PLAYER`

.

TODO: add public state for `SPECTATOR`

. Ref: https://arxiv.org/abs/1906.11110

`ReinforcementLearningEnvironments.MaxTimeoutEnv`

— Method`MaxTimeoutEnv(env::E, max_t::Int; current_t::Int = 1)`

Force `is_terminated(env)`

return `true`

after `max_t`

interactions.

`ReinforcementLearningEnvironments.MontyHallEnv`

— Method`MontyHallEnv(;rng=Random.default_rng())`

Quoted from wiki:

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?

Here we'll introduce the first environment which is of `FULL_ACTION_SET`

.

`ReinforcementLearningEnvironments.MountainCarEnv`

— Method`MountainCarEnv(;kwargs...)`

**Keyword arguments**

`T = Float64`

`continuous = false`

`rng = Random.default_rng()`

`min_pos = -1.2`

`max_pos = 0.6`

`max_speed = 0.07`

`goal_pos = 0.5`

`max_steps = 200`

`goal_velocity = 0.0`

`power = 0.001`

`gravity = 0.0025`

`ReinforcementLearningEnvironments.MultiArmBanditsEnv`

— Method`MultiArmBanditsEnv(;true_reward=0., k = 10,rng=Random.default_rng())`

`true_reward`

is the expected reward. `k`

is the number of arms. See multi-armed bandit for more detailed explanation.

This is a **one-shot** game. The environment terminates immediately after taking in an action. Here we use it to demonstrate how to write a customized environment with only minimal interfaces defined.

`ReinforcementLearningEnvironments.PendulumEnv`

— Method`PendulumEnv(;kwargs...)`

**Keyword arguments**

`T = Float64`

`max_speed = T(8)`

`max_torque = T(2)`

`g = T(10)`

`m = T(1)`

`l = T(1)`

`dt = T(0.05)`

`max_steps = 200`

`continuous::Bool = true`

`n_actions::Int = 3`

`rng = Random.default_rng()`

`ReinforcementLearningEnvironments.PendulumNonInteractiveEnv`

— TypeA non-interactive pendulum environment.

Accepts only `nothing`

actions, which result in the system being simulated for one time step. Sets `env.done`

to `true`

once `maximum_time`

is reached. Resets to a random position and momentum. Always returns zero rewards.

Useful for debugging and development purposes, particularly in model-based reinforcement learning.

`ReinforcementLearningEnvironments.PendulumNonInteractiveEnv`

— Method`PendulumNonInteractiveEnv(;kwargs...)`

**Keyword arguments**

`float_type = Float64`

`gravity = 9.8`

`length = 2.0`

`mass = 1.0`

`step_size = 0.01`

`maximum_time = 10.0`

`rng = Random.default_rng()`

`ReinforcementLearningEnvironments.PigEnv`

— Method`PigEnv(;n_players=2)`

See wiki for explanation of this game.

Here we use it to demonstrate how to write a game with more than 2 players.

`ReinforcementLearningEnvironments.RandomWalk1D`

— Type`RandomWalk1D(;rewards=-1. => 1.0, N=7, start_pos=(N+1) ÷ 2, actions=[-1,1])`

An agent is placed at the `start_pos`

and can move left or right (stride is defined in actions). The game terminates when the agent reaches either end and receives a reward correspondingly.

Compared to the `MultiArmBanditsEnv`

:

- The state space is more complicated (well, not that complicated though).
- It's a sequential game of multiple action steps.
- It's a deterministic game instead of stochastic game.

`ReinforcementLearningEnvironments.RewardOverriddenEnv`

— Type`RewardOverriddenEnv(env, f)`

Apply `f`

on `env`

to generate a custom reward.

`ReinforcementLearningEnvironments.RewardTransformedEnv`

— Type`RewardTransformedEnv(env, f)`

Apply `f`

on `reward(env)`

.

`ReinforcementLearningEnvironments.RockPaperScissorsEnv`

— Type`RockPaperScissorsEnv()`

Rock Paper Scissors is a simultaneous, zero sum game.

`ReinforcementLearningEnvironments.StateCachedEnv`

— TypeCache the state so that `state(env)`

will always return the same result before the next interaction with `env`

. This function is useful because some environments are stateful during each `state(env)`

. For example: `StateTransformedEnv(StackFrames(...))`

.

`ReinforcementLearningEnvironments.StateTransformedEnv`

— Method`StateTransformedEnv(env; state_mapping=identity, state_space_mapping=identity)`

`state_mapping`

will be applied on the original state when calling `state(env)`

, and similarly `state_space_mapping`

will be applied when calling `state_space(env)`

.

`ReinforcementLearningEnvironments.StockTradingEnv`

— Method`StockTradingEnv(;kw...)`

This environment is originally provided in Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy

**Keyword Arguments**

`initial_account_balance=1_000_000`

.

`ReinforcementLearningEnvironments.TicTacToeEnv`

— Method`TicTacToeEnv()`

Create a new instance of the TicTacToe environment.

`ReinforcementLearningEnvironments.TigerProblemEnv`

— Type`TigerProblemEnv(;rng=Random>GLOBAL_RNG)`

Here we use the The Tiger Proglem to demonstrate how to write a POMDP problem.

`ReinforcementLearningEnvironments.TinyHanabiEnv`

— Method`TinyHanabiEnv()`

See https://arxiv.org/abs/1902.00506.

`Random.seed!`

— MethodThe multi-arm bandits environment is a stochastic environment. The resulted reward may be different even after taking the same actions each time. So for this kind of environments, the `Random.seed!(env)`

must be implemented to help increase reproducibility without creating a new instance of the same `rng`

.

`ReinforcementLearningBase.act!`

— MethodIn our design, the return of taking an action in `env`

is **undefined**. This is the main difference compared to those interfaces defined in OpenAI/Gym. We find that the async manner is more suitable to describe many complicated environments. However, one of the inconveniences is that we have to cache some intermediate data for future queries. Here we have to store `reward`

and `is_terminated`

in the instance of `env`

for future queries.

`ReinforcementLearningBase.action_space`

— MethodFirst we need to define the action space. In the `MultiArmBanditsEnv`

environment, the possible actions are `1`

to `k`

(which equals to `length(env.true_values)`

).

Although we decide to return an action space of `Base.OneTo`

here, it is not a hard requirement. You can return anything else (`Tuple`

, `Distribution`

, etc) that is more suitable to describe your problem and handle it correctly in the `you_env(action)`

function. Some algorithms may require that the action space must be of `Base.OneTo`

. However, it's the algorithm designer's job to do the checking and conversion.

`ReinforcementLearningBase.current_player`

— MethodNote that although this is a two player game, the current player is always a dummy simultaneous player.

`ReinforcementLearningBase.legal_action_space`

— MethodIn the first round, the guest has 3 options, in the second round only two options are valid, those different then the host's action.

`ReinforcementLearningBase.legal_action_space_mask`

— MethodFor environments of [`FULL_ACTION_SET`

], this function must be implemented.

`ReinforcementLearningBase.reward`

— MethodIf the `env`

is not started yet, the returned value is meaningless. The reason why we don't throw an exception here is to simplify the code logic to keep type consistency when storing the value in buffers.

`ReinforcementLearningBase.state`

— MethodSince `MultiArmBanditsEnv`

is just a one-shot game, it doesn't matter what the state is after each action. So here we can simply set it to a constant `1`

.

`ReinforcementLearningBase.state`

— MethodFor multi-agent environments, we usually implement the most detailed one.

`ReinforcementLearningBase.state`

— MethodThe main difference compared to other environments is that, now we have two kinds of *states*. The **observation** and the **internal state**. By default we return the **observation**.

`ReinforcementLearningBase.state_space`

— MethodSince it's a one-shot game, the state space doesn't have much meaning.

`ReinforcementLearningEnvironments.discrete2standard_discrete`

— Method`discrete2standard_discrete(env)`

Convert an `env`

with a discrete action space to a standard form:

- The action space is of type
`Base.OneTo`

- If the
`env`

is of`FULL_ACTION_SET`

, then each action in the`legal_action_space(env)`

is also an`Int`

in the action space.

The standard form is useful for some algorithms (like Q-learning).

`ReinforcementLearningEnvironments.install_gym`

— Method`install_gym(; packages = ["gym", "pybullet"])`