ReinforcementLearningEnvironments.jl
Built-in Environments
Traits | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ActionStyle | MinimalActionSet | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||
FullActionSet | ✔ | ✔ | ||||||||||||
ChanceStyle | Stochastic | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||||||
Deterministic | ✔ | ✔ | ✔ | |||||||||||
ExplicitStochastic | ✔ | ✔ | ✔ | |||||||||||
DefaultStateStyle | Observation | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||
InformationSet | ✔ | ✔ | ||||||||||||
DynamicStyle | Simultaneous | ✔ | ||||||||||||
Sequential | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||
InformationStyle | PerfectInformation | ✔ | ✔ | ✔ | ||||||||||
ImperfectInformation | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||||
NumAgentStyle | MultiAgent | ✔ | ✔ | ✔ | ✔ | ✔ | ||||||||
SingleAgent | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||||||
RewardStyle | TerminalReward | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |||||
StepReward | ✔ | ✔ | ✔ | ✔ | ✔ | |||||||||
StateStyle | Observation | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||
InformationSet | ✔ | ✔ | ||||||||||||
InternalState | ✔ | |||||||||||||
UtilityStyle | GeneralSum | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |||||
ZeroSum | ✔ | ✔ | ✔ | |||||||||||
ConstantSum | ✔ | |||||||||||||
IdenticalUtility | ✔ |
- MultiArmBanditsEnv
- RandomWalk1D
- TigerProblemEnv
- MontyHallEnv
- RockPaperScissorsEnv
- TicTacToeEnv
- TinyHanabiEnv
- PigEnv
- KuhnPokerEnv
- AcrobotEnv
- CartPoleEnv
- MountainCarEnv
- PendulumEnv
Note: Many traits are borrowed from OpenSpiel.
3-rd Party Environments
Environment Name | Dependent Package Name | Description |
---|---|---|
AtariEnv | ArcadeLearningEnvironment.jl | |
GymEnv | PyCall.jl | |
OpenSpielEnv | OpenSpiel.jl | |
SnakeGameEnv | SnakeGames.jl | SingleAgent /Multi-Agent , FullActionSet /MinimalActionSet |
#list-of-environments | GridWorlds.jl | Environments in this package support the interfaces defined in RLBase |
ReinforcementLearningEnvironments.KUHN_POKER_REWARD_TABLE
— ConstantReinforcementLearningEnvironments.ActionTransformedEnv
— MethodActionTransformedEnv(env;action_space_mapping=identity, action_mapping=identity)
action_space_mapping
will be applied to action_space(env)
and legal_action_space(env)
. action_mapping
will be applied to action
before feeding it into env
.
ReinforcementLearningEnvironments.AtariEnv
— MethodAtariEnv(;kwargs...)
This implementation follows the guidelines in Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents
Keywords
name::String="pong"
: name of the Atari environments. UseReinforcementLearningEnvironments.list_atari_rom_names()
to show all supported environments.grayscale_obs::Bool=true
:iftrue
, then gray scale observation is returned, otherwise, RGB observation is returned.noop_max::Int=30
: max number of no-ops.frame_skip::Int=4
: the frequency at which the agent experiences the game.terminal_on_life_loss::Bool=false
: iftrue
, then game is over whenever a life is lost.repeat_action_probability::Float64=0.
color_averaging::Bool=false
: whether to perform phosphor averaging or not.max_num_frames_per_episode::Int=0
full_action_space::Bool=false
: by default, only use minimal action set. Iftrue
, one need to calllegal_actions
to get the valid action set. TODOseed::Int
is used to set the initial seed of the underlying C environment and the rng used by the this wrapper environment to initialize the number of no-op steps at the beginning of each episode.log_level::Symbol
,:info
,:warning
or:error
. Default value is:error
.
See also the python implementation
ReinforcementLearningEnvironments.CartPoleEnv
— MethodCartPoleEnv(;kwargs...)
Keyword arguments
T = Float64
continuous = false
rng = Random.GLOBAL_RNG
gravity = T(9.8)
masscart = T(1.0)
masspole = T(0.1)
halflength = T(0.5)
forcemag = T(10.0)
max_steps = 200
dt = 0.02
thetathreshold = 12.0 # degrees
xthreshold
= 2.4`
ReinforcementLearningEnvironments.DefaultStateStyleEnv
— MethodDefaultStateStyleEnv{S}(env::E)
Reset the result of DefaultStateStyle
without changing the original behavior.
ReinforcementLearningEnvironments.GraphShortestPathEnv
— TypeGraphShortestPathEnv([rng]; n=10, sparsity=0.1, max_steps=10)
Quoted A.3 in the the paper Decision Transformer: Reinforcement Learning via Sequence Modeling.
We give details of the illustrative example discussed in the introduction. The task is to find theshortest path on a fixed directed graph, which can be formulated as an MDP where reward is0whenthe agent is at the goal node and−1otherwise. The observation is the integer index of the graphnode the agent is in. The action is the integer index of the graph node to move to next. The transitiondynamics transport the agent to the action’s node index if there is an edge in the graph, while theagent remains at the past node otherwise. The returns-to-go in this problem correspond to negativepath lengths and maximizing them corresponds to generating shortest paths.
ReinforcementLearningEnvironments.KuhnPokerEnv
— MethodKuhnPokerEnv()
See more detailed description here.
Here we demonstrate how to write a typical ZERO_SUM
, IMPERFECT_INFORMATION
game. The implementation here has a explicit CHANCE_PLAYER
.
TODO: add public state for SPECTOR
. Ref: https://arxiv.org/abs/1906.11110
ReinforcementLearningEnvironments.MaxTimeoutEnv
— MethodMaxTimeoutEnv(env::E, max_t::Int; current_t::Int = 1)
Force is_terminated(env)
return true
after max_t
interactions.
ReinforcementLearningEnvironments.MontyHallEnv
— MethodMontyHallEnv(;rng=Random.GLOBAL_RNG)
Quoted from wiki:
Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?
Here we'll introduce the first environment which is of FULL_ACTION_SET
.
ReinforcementLearningEnvironments.MountainCarEnv
— MethodMountainCarEnv(;kwargs...)
Keyword arguments
T = Float64
continuous = false
rng = Random.GLOBAL_RNG
min_pos = -1.2
max_pos = 0.6
max_speed = 0.07
goal_pos = 0.5
max_steps = 200
goal_velocity = 0.0
power = 0.001
gravity = 0.0025
ReinforcementLearningEnvironments.MultiArmBanditsEnv
— MethodIn our design, the return of taking an action in env
is undefined. This is the main difference compared to those interfaces defined in OpenAI/Gym. We find that the async manner is more suitable to describe many complicated environments. However, one of the inconveniences is that we have to cache some intermediate data for future queries. Here we have to store reward
and is_terminated
in the instance of env
for future queries.
ReinforcementLearningEnvironments.MultiArmBanditsEnv
— MethodMultiArmBanditsEnv(;true_reward=0., k = 10,rng=Random.GLOBAL_RNG)
true_reward
is the expected reward. k
is the number of arms. See multi-armed bandit for more detailed explanation.
This is a one-shot game. The environment terminates immediately after taking in an action. Here we use it to demonstrate how to write a customized environment with only minimal interfaces defined.
ReinforcementLearningEnvironments.OpenSpielEnv
— TypeOpenSpielEnv(name; state_type=nothing, kwargs...)
Arguments
name
::String
, you can callOpenSpiel.registered_names()
to see all the supported names. Note that the name can contains parameters, like"goofspiel(imp_info=True,num_cards=4,points_order=descending)"
. Because the parameters part is parsed by the backend C++ code, the bool variable must beTrue
orFalse
(instead oftrue
orfalse
). Another approach is to just specify parameters inkwargs
in the Julia style.
ReinforcementLearningEnvironments.PendulumEnv
— MethodPendulumEnv(;kwargs...)
Keyword arguments
T = Float64
max_speed = T(8)
max_torque = T(2)
g = T(10)
m = T(1)
l = T(1)
dt = T(0.05)
max_steps = 200
continuous::Bool = true
n_actions::Int = 3
rng = Random.GLOBAL_RNG
ReinforcementLearningEnvironments.PendulumNonInteractiveEnv
— TypeA non-interactive pendulum environment.
Accepts only nothing
actions, which result in the system being simulated for one time step. Sets env.done
to true
once maximum_time
is reached. Resets to a random position and momentum. Always returns zero rewards.
Useful for debugging and development purposes, particularly in model-based reinforcement learning.
ReinforcementLearningEnvironments.PendulumNonInteractiveEnv
— MethodPendulumNonInteractiveEnv(;kwargs...)
Keyword arguments
float_type = Float64
gravity = 9.8
length = 2.0
mass = 1.0
step_size = 0.01
maximum_time = 10.0
rng = Random.GLOBAL_RNG
ReinforcementLearningEnvironments.PigEnv
— MethodPigEnv(;n_players=2)
See wiki for explanation of this game.
Here we use it to demonstrate how to write a game with more than 2 players.
ReinforcementLearningEnvironments.RandomWalk1D
— TypeRandomWalk1D(;rewards=-1. => 1.0, N=7, start_pos=(N+1) ÷ 2, actions=[-1,1])
An agent is placed at the start_pos
and can move left or right (stride is defined in actions). The game terminates when the agent reaches either end and receives a reward correspondingly.
Compared to the MultiArmBanditsEnv
:
- The state space is more complicated (well, not that complicated though).
- It's a sequential game of multiple action steps.
- It's a deterministic game instead of stochastic game.
ReinforcementLearningEnvironments.RewardOverriddenEnv
— TypeRewardOverriddenEnv(env, f)
Apply f
on env
to generate a custom reward.
ReinforcementLearningEnvironments.RewardTransformedEnv
— TypeRewardTransformedEnv(env, f)
Apply f
on reward(env)
.
ReinforcementLearningEnvironments.RockPaperScissorsEnv
— TypeRockPaperScissorsEnv()
Rock Paper Scissors is a simultaneous, zero sum game.
ReinforcementLearningEnvironments.SequentialEnv
— TypeSequentialEnv(env)
Turn a simultaneous env
into a sequential env.
ReinforcementLearningEnvironments.SpeakerListenerEnv
— MethodSpeakerListenerEnv(;kwargs...)
SpeakerListenerEnv
is a simple cooperative environment of two agents, a Speaker
and a Listener
, who are placed in an environment with N
landmarks. At each episode, the Listener
must navigate to a particular landmark(env.target
) and obtains reward based on its distance to the target. However, while the Listener
can observe the relative position of the landmarks, it doesn't know which is the target landmark. Conversely, the Speaker
can observe the target's landmark, and it can produce a communication output(env.content
) at each time step which is observed by the Listener
.
For more concrete description, you can refer to:
Keyword arguments
N::Int = 3
, the number of landmarks in the environment.stop = 0.01
, when the distance between theListener
and the target is smaller than thestop
, the game will be terminated.damping = 0.25
, for simulation of the physical space,Listener
's action will meet the damping in each step.max_accel = 0.02
, the maximum acceleration of theListener
in each step.space_dim::Int = 2
, the dimension of the environment's space.max_steps::Int = 25
, the maximum playing steps in one episode.continuous::Bool = true
, set tofalse
if you want the actionspace of the players to be discrete. Otherwise, the actionspace will be continuous.
ReinforcementLearningEnvironments.StateCachedEnv
— TypeCache the state so that state(env)
will always return the same result before the next interaction with env
. This function is useful because some environments are stateful during each state(env)
. For example: StateTransformedEnv(StackFrames(...))
.
ReinforcementLearningEnvironments.StateTransformedEnv
— MethodStateTransformedEnv(env; state_mapping=identity, state_space_mapping=identity)
state_mapping
will be applied on the original state when calling state(env)
, and similarly state_space_mapping
will be applied when calling state_space(env)
.
ReinforcementLearningEnvironments.StockTradingEnv
— MethodStockTradingEnv(;kw...)
This environment is originally provided in Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy
Keyword Arguments
initial_account_balance=1_000_000
.
ReinforcementLearningEnvironments.TicTacToeEnv
— TypeThis is a typical two player, zero sum game. Here we'll also demonstrate how to implement an environment with multiple state representations.
You might be interested in this blog
ReinforcementLearningEnvironments.TigerProblemEnv
— TypeTigerProblemEnv(;rng=Random>GLOBAL_RNG)
Here we use the The Tiger Proglem to demonstrate how to write a POMDP problem.
ReinforcementLearningEnvironments.TinyHanabiEnv
— MethodTinyHanabiEnv()
See https://arxiv.org/abs/1902.00506.
ReinforcementLearningEnvironments.ZeroTo
— TypeSimilar to Base.OneTo
. Useful when wrapping third-party environments.
Random.seed!
— MethodThe multi-arm bandits environment is a stochastic environment. The resulted reward may be different even after taking the same actions each time. So for this kind of environments, the Random.seed!(env)
must be implemented to help increase reproducibility without creating a new instance of the same rng
.
ReinforcementLearningBase.action_space
— MethodFirst we need to define the action space. In the MultiArmBanditsEnv
environment, the possible actions are 1
to k
(which equals to length(env.true_values)
).
Although we decide to return an action space of Base.OneTo
here, it is not a hard requirement. You can return anything else (Tuple
, Distribution
, etc) that is more suitable to describe your problem and handle it correctly in the you_env(action)
function. Some algorithms may require that the action space must be of Base.OneTo
. However, it's the algorithm designer's job to do the checking and conversion.
ReinforcementLearningBase.current_player
— MethodNote that although this is a two player game. The current player is always a dummy simultaneous player.
ReinforcementLearningBase.legal_action_space
— MethodIn the first round, the guest has 3 options, in the second round only two options are valid, those different then the host's action.
ReinforcementLearningBase.legal_action_space_mask
— MethodFor environments of [FULL_ACTION_SET
], this function must be implemented.
ReinforcementLearningBase.reward
— MethodIf the env
is not started yet, the returned value is meaningless. The reason why we don't throw an exception here is to simplify the code logic to keep type consistency when storing the value in buffers.
ReinforcementLearningBase.state
— MethodSince MultiArmBanditsEnv
is just a one-shot game, it doesn't matter what the state is after each action. So here we can simply set it to a constant 1
.
ReinforcementLearningBase.state
— MethodFor multi-agent environments, we usually implement the most detailed one.
ReinforcementLearningBase.state
— MethodThe main difference compared to other environments is that, now we have two kinds of states. The observation and the internal state. By default we return the observation.
ReinforcementLearningBase.state_space
— MethodSince it's a one-shot game, the state space doesn't have much meaning.
ReinforcementLearningEnvironments.discrete2standard_discrete
— Methoddiscrete2standard_discrete(env)
Convert an env
with a discrete action space to a standard form:
- The action space is of type
Base.OneTo
- If the
env
is ofFULL_ACTION_SET
, then each action in thelegal_action_space(env)
is also anInt
in the action space.
The standard form is useful for some algorithms (like Q-learning).
ReinforcementLearningEnvironments.install_gym
— Methodinstall_gym(; packages = ["gym", "pybullet"])