ReinforcementLearningEnvironments.jl

Built-in Environments

Traits 1 2 3 4 5 6 7 8 9 10 11 12 13
ActionStyle MinimalActionSet
FullActionSet
ChanceStyle Stochastic
Deterministic
ExplicitStochastic
DefaultStateStyle Observation
InformationSet
DynamicStyle Simultaneous
Sequential
InformationStyle PerfectInformation
ImperfectInformation
NumAgentStyle MultiAgent
SingleAgent
RewardStyle TerminalReward
StepReward
StateStyle Observation
InformationSet
InternalState
UtilityStyle GeneralSum
ZeroSum
ConstantSum
IdenticalUtility
  1. MultiArmBanditsEnv
  2. RandomWalk1D
  3. TigerProblemEnv
  4. MontyHallEnv
  5. RockPaperScissorsEnv
  6. TicTacToeEnv
  7. TinyHanabiEnv
  8. PigEnv
  9. KuhnPokerEnv
  10. AcrobotEnv
  11. CartPoleEnv
  12. MountainCarEnv
  13. PendulumEnv

Note: Many traits are borrowed from OpenSpiel.

3-rd Party Environments

Environment NameDependent Package NameDescription
AtariEnvArcadeLearningEnvironment.jl
GymEnvPyCall.jl
OpenSpielEnvOpenSpiel.jl
SnakeGameEnvSnakeGames.jlSingleAgent/Multi-Agent, FullActionSet/MinimalActionSet
#list-of-environmentsGridWorlds.jlEnvironments in this package use the interfaces defined in RLBae directly
ReinforcementLearningEnvironments.ActionTransformedEnvMethod
ActionTransformedEnv(env;action_space_mapping=identity, action_mapping=identity)

action_space_mapping will be applied to action_space(env) and legal_action_space(env). action_mapping will be applied to action before feeding it into env.

source
ReinforcementLearningEnvironments.AtariEnvMethod
AtariEnv(;kwargs...)

This implementation follows the guidelines in Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents

Keywords

  • name::String="pong": name of the Atari environments. Use ReinforcementLearningEnvironments.list_atari_rom_names() to show all supported environments.
  • grayscale_obs::Bool=true:if true, then gray scale observation is returned, otherwise, RGB observation is returned.
  • noop_max::Int=30: max number of no-ops.
  • frame_skip::Int=4: the frequency at which the agent experiences the game.
  • terminal_on_life_loss::Bool=false: if true, then game is over whenever a life is lost.
  • repeat_action_probability::Float64=0.
  • color_averaging::Bool=false: whether to perform phosphor averaging or not.
  • max_num_frames_per_episode::Int=0
  • full_action_space::Bool=false: by default, only use minimal action set. If true, one need to call legal_actions to get the valid action set. TODO
  • seed::Int is used to set the initial seed of the underlying C environment and the rng used by the this wrapper environment to initialize the number of no-op steps at the beginning of each episode.
  • log_level::Symbol, :info, :warning or :error. Default value is :error.

See also the python implementation

source
ReinforcementLearningEnvironments.MontyHallEnvMethod
MontyHallEnv(;rng=Random.GLOBAL_RNG)

Quoted from wiki:

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?

Here we'll introduce the first environment which is of FULL_ACTION_SET.

source
ReinforcementLearningEnvironments.MountainCarEnvMethod
MountainCarEnv(;kwargs...)

Keyword arguments

  • T = Float64
  • continuous = false
  • rng = Random.GLOBAL_RNG
  • min_pos = -1.2
  • max_pos = 0.6
  • max_speed = 0.07
  • goal_pos = 0.5
  • max_steps = 200
  • goal_velocity = 0.0
  • power = 0.001
  • gravity = 0.0025
source
ReinforcementLearningEnvironments.MultiArmBanditsEnvMethod

In our design, the return of taking an action in env is undefined. This is the main difference compared to those interfaces defined in OpenAI/Gym. We find that the async manner is more suitable to describe many complicated environments. However, one of the inconveniences is that we have to cache some intermediate data for future queries. Here we have to store reward and is_terminated in the instance of env for future queries.

source
ReinforcementLearningEnvironments.MultiArmBanditsEnvMethod
MultiArmBanditsEnv(;true_reward=0., k = 10,rng=Random.GLOBAL_RNG)

true_reward is the expected reward. k is the number of arms. See multi-armed bandit for more detailed explanation.

This is a one-shot game. The environment terminates immediately after taking in an action. Here we use it to demonstrate how to write a customized environment with only minimal interfaces defined.

source
ReinforcementLearningEnvironments.OpenSpielEnvType
OpenSpielEnv(name; state_type=nothing, kwargs...)

Arguments

  • name::String, you can call OpenSpiel.registered_names() to see all the supported names. Note that the name can contains parameters, like "goofspiel(imp_info=True,num_cards=4,points_order=descending)". Because the parameters part is parsed by the backend C++ code, the bool variable must be True or False (instead of true or false). Another approach is to just specify parameters in kwargs in the Julia style.
source
ReinforcementLearningEnvironments.PendulumEnvMethod
PendulumEnv(;kwargs...)

Keyword arguments

  • T = Float64
  • max_speed = T(8)
  • max_torque = T(2)
  • g = T(10)
  • m = T(1)
  • l = T(1)
  • dt = T(0.05)
  • max_steps = 200
  • continuous::Bool = true
  • n_actions::Int = 3
  • rng = Random.GLOBAL_RNG
source
ReinforcementLearningEnvironments.PendulumNonInteractiveEnvType

A non-interactive pendulum environment.

Accepts only nothing actions, which result in the system being simulated for one time step. Sets env.done to true once maximum_time is reached. Resets to a random position and momentum. Always returns zero rewards.

Useful for debugging and development purposes, particularly in model-based reinforcement learning.

source
ReinforcementLearningEnvironments.RandomWalk1DType
RandomWalk1D(;rewards=-1. => 1.0, N=7, start_pos=(N+1) ÷ 2, actions=[-1,1])

An agent is placed at the start_pos and can move left or right (stride is defined in actions). The game terminates when the agent reaches either end and receives a reward correspondingly.

Compared to the MultiArmBanditsEnv:

  1. The state space is more complicated (well, not that complicated though).
  2. It's a sequential game of multiple action steps.
  3. It's a deterministic game instead of stochastic game.
source
ReinforcementLearningEnvironments.StateCachedEnvType

Cache the state so that state(env) will always return the same result before the next interaction with env. This function is useful because some environments are stateful during each state(env). For example: StateTransformedEnv(StackFrames(...)).

source
ReinforcementLearningEnvironments.StateTransformedEnvMethod
StateTransformedEnv(env; state_mapping=identity, state_space_mapping=identity)

state_mapping will be applied on the original state when calling state(env), and similarly state_space_mapping will be applied when calling state_space(env).

source
Random.seed!Method

The multi-arm bandits environment is a stochastic environment. The resulted reward may be different even after taking the same actions each time. So for this kind of environments, the Random.seed!(env) must be implemented to help increase reproducibility without creating a new instance of the same rng.

source
ReinforcementLearningBase.action_spaceMethod

First we need to define the action space. In the MultiArmBanditsEnv environment, the possible actions are 1 to k (which equals to length(env.true_values)).

Note

Although we decide to return an action space of Base.OneTo here, it is not a hard requirement. You can return anything else (Tuple, Distribution, etc) that is more suitable to describe your problem and handle it correctly in the you_env(action) function. Some algorithms may require that the action space must be of Base.OneTo. However, it's the algorithm designer's job to do the checking and conversion.

source
ReinforcementLearningBase.rewardMethod
Warn

If the env is not started yet, the returned value is meaningless. The reason why we don't throw an exception here is to simplify the code logic to keep type consistency when storing the value in buffers.

source
ReinforcementLearningBase.stateMethod

Since MultiArmBanditsEnv is just a one-shot game, it doesn't matter what the state is after each action. So here we can simply set it to a constant 1.

source
ReinforcementLearningBase.stateMethod

The main difference compared to other environments is that, now we have two kinds of states. The observation and the internal state. By default we return the observation.

source
ReinforcementLearningEnvironments.discrete2standard_discreteMethod
discrete2standard_discrete(env)

Convert an env with a discrete action space to a standard form:

  • The action space is of type Base.OneTo
  • If the env is of FULL_ACTION_SET, then each action in the legal_action_space(env) is also an Int in the action space.

The standard form is useful for some algorithms (like Q-learning).

source