How to write a customized environment?
The first step to apply algorithms in ReinforcementLearning.jl is to define the problem you want to solve in a recognizable way. Here we'll demonstrate how to write many different kinds of environments based on interfaces defined in ReinforcementLearningBase.jl.
The most commonly used interfaces to describe reinforcement learning tasks is OpenAI/Gym. Inspired by it, we expand those interfaces a little to utilize multiple-dispatch in Julia and to cover multi-agent environments.
The Minimal Interfaces to Implement
Many interfaces in ReinforcementLearningBase.jl have a default implementation. So in most cases, you only need to implement the following functions to define a customized environment:
action_space(env::YourEnv)
state(env::YourEnv)
state_space(env::YourEnv)
reward(env::YourEnv)
is_terminated(env::YourEnv)
reset!(env::YourEnv)
act!(env::YourEnv, action)
An Example: The LotteryEnv
Here we use an example introduced in Monte Carlo Tree Search: A Tutorial to demonstrate how to write a simple environment.
The game is defined like this: assume you have $10 in your pocket, and you are faced with the following three choices:
- Buy a PowerRich lottery ticket (win $100M w.p. 0.01; nothing otherwise);
- Buy a MegaHaul lottery ticket (win $1M w.p. 0.05; nothing otherwise);
- Do not buy a lottery ticket.
This game is a one-shot game. It terminates immediately after taking an action and a reward is received. First we define a concrete subtype of AbstractEnv
named LotteryEnv
:
julia> using ReinforcementLearning
julia> Base.@kwdef mutable struct LotteryEnv <: AbstractEnv reward::Union{Nothing, Int} = nothing end
Main.LotteryEnv
The LotteryEnv
has only one field named reward
, by default it is initialized with nothing
. Now let's implement the necessary interfaces:
julia> struct LotteryAction{a} function LotteryAction(a) new{a}() end end
julia> RLBase.action_space(env::LotteryEnv) = LotteryAction.([:PowerRich, :MegaHaul, nothing])
Here RLBase
is just an alias for ReinforcementLearningBase
.
julia> RLBase.reward(env::LotteryEnv) = env.reward
julia> RLBase.state(env::LotteryEnv, ::Observation, ::DefaultPlayer) = !isnothing(env.reward)
julia> RLBase.state_space(env::LotteryEnv) = [false, true]
julia> RLBase.is_terminated(env::LotteryEnv) = !isnothing(env.reward)
julia> RLBase.reset!(env::LotteryEnv) = env.reward = nothing
Because the lottery game is just a simple one-shot game. If the reward
is nothing
then the game is not started yet and we say the game is in state false
, otherwise the game is terminated and the state is true
. So the result of state_space(env)
describes the possible states of this environment. By reset!
the game, we simply assign the reward with nothing
, meaning that it's in the initial state again.
The only left one is to implement the game logic:
julia> function RLBase.act!(x::LotteryEnv, action) if action == LotteryAction(:PowerRich) x.reward = rand() < 0.01 ? 100_000_000 : -10 elseif action == LotteryAction(:MegaHaul) x.reward = rand() < 0.05 ? 1_000_000 : -10 elseif action == LotteryAction(nothing) x.reward = 0 else @error "unknown action of $action" end end
Test Your Environment
A method named RLBase.test_runnable!
is provided to rollout several simulations and see whether the environment we defined is functional.
julia> env = LotteryEnv()
# LotteryEnv ## Traits | Trait Type | Value | |:----------------- | ----------------------:| | NumAgentStyle | SingleAgent() | | DynamicStyle | Sequential() | | InformationStyle | ImperfectInformation() | | ChanceStyle | Stochastic() | | RewardStyle | StepReward() | | UtilityStyle | GeneralSum() | | ActionStyle | MinimalActionSet() | | StateStyle | Observation{Any}() | | DefaultStateStyle | Observation{Any}() | | EpisodeStyle | Episodic() | ## Is Environment Terminated? No ## State Space `Bool[0, 1]` ## Action Space `Main.LotteryAction[Main.LotteryAction{:PowerRich}(), Main.LotteryAction{:MegaHaul}(), Main.LotteryAction{nothing}()]` ## Current State ``` false ```
julia> RLBase.test_runnable!(env)
Test Summary: | Pass Total Time random policy with LotteryEnv | 2000 2000 0.1s Test.DefaultTestSet("random policy with LotteryEnv", Any[], 2000, false, false, true, 1.736773351501364e9, 1.736773351614487e9, false, "/home/runner/work/ReinforcementLearning.jl/ReinforcementLearning.jl/src/ReinforcementLearningBase/src/base.jl")
It is a simple smell test which works like this:
n_episode = 10
for _ in 1:n_episode
reset!(env)
while !is_terminated(env)
action = rand(action_space(env))
act!(env, action)
end
end
One step further is to test that other components in ReinforcementLearning.jl also work. Similar to the test above, let's try the RandomPolicy
first:
julia> run(RandomPolicy(action_space(env)), env, StopAfterNEpisodes(1_000))
EmptyHook()
If no error shows up, then it means our environment at least works with the RandomPolicy
🎉🎉🎉. Next, we can add a hook to collect the reward in each episode to see the performance of the RandomPolicy
.
julia> hook = TotalRewardPerEpisode()
TotalRewardPerEpisode{Val{true}, Float64}(Float64[], 0.0, true)
julia> run(RandomPolicy(action_space(env)), env, StopAfterNEpisodes(1_000), hook)
TotalRewardPerEpisode{Val{true}, Float64}([0.0, -10.0, 0.0, 0.0, -10.0, -10.0, -10.0, 0.0, 0.0, -10.0 … 1.0e8, -10.0, 0.0, -10.0, -10.0, 0.0, 0.0, -10.0, 0.0, -10.0], 0.0, true)
julia> using Plots
julia> plot(hook.rewards)
Plot{Plots.PyPlotBackend() n=1}
Add an Environment Wrapper
Now suppose we'd like to use a tabular based monte carlo method to estimate the state-action value.
julia> p = QBasedPolicy( learner = TDLearner( TabularQApproximator( n_state = length(state_space(env)), n_action = length(action_space(env)), ), :SARS ), explorer = EpsilonGreedyExplorer(0.1) )
QBasedPolicy( TDLearner( TabularQApproximator{Matrix{Float64}}([0.0 0.0; 0.0 0.0; 0.0 0.0]), # 6 parameters (all zero) 1.0, 0.01, 0, ), )
julia> plan!(p, env)
ERROR: MethodError: no method matching forward(::TDLearner{:SARS, TabularQApproximator{Matrix{Float64}}}, ::Bool) The function `forward` exists, but no method is defined for this combination of argument types. Closest candidates are: forward(::TDLearner, ::Int64, ::Int64) @ ReinforcementLearningCore ~/work/ReinforcementLearning.jl/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:37 forward(::TDLearner, ::Int64) @ ReinforcementLearningCore ~/work/ReinforcementLearning.jl/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:36 forward(::L, ::E) where {L<:AbstractLearner, E<:AbstractEnv} @ ReinforcementLearningCore ~/work/ReinforcementLearning.jl/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/abstract_learner.jl:15 ...
Oops, we get an error here. So what does it mean?
Before answering this question, let's spend some time on understanding the policy we defined above. A QBasedPolicy
contains two parts: a learner
and an explorer
. The learner
learn the state-action value function (aka Q function) during interactions with the env
. The explorer
is used to select an action based on the Q value returned by the learner
. Inside of the TDLearner
, a TabularQApproximator
is used to estimate the Q value.
That's the problem! A TabularQApproximator
only accepts states of type Int
.
julia> RLCore.forward(p.learner.approximator, 1, 1) # Q(s, a)
0.0
julia> RLCore.forward(p.learner.approximator, 1) # [Q(s, a) for a in action_space(env)]
3-element view(::Matrix{Float64}, :, 1) with eltype Float64: 0.0 0.0 0.0
julia> RLCore.forward(p.learner.approximator, false)
ERROR: ArgumentError: invalid index: false of type Bool
OK, now we know where the problem is. But how to fix it?
An initial idea is to rewrite the RLBase.state(env::LotteryEnv, ::Observation, ::DefaultPlayer)
function to force it return an Int
. That's workable. But in some cases, we may be using environments written by others and it's not very easy to modify the code directly. Fortunatelly, some environment wrappers are provided to help us transform the environment.
julia> wrapped_env = ActionTransformedEnv( StateTransformedEnv( env; state_mapping=s -> s ? 1 : 2, state_space_mapping = _ -> Base.OneTo(2) ); action_mapping = i -> action_space(env)[i], action_space_mapping = _ -> Base.OneTo(3), )
# LotteryEnv |> StateTransformedEnv |> ActionTransformedEnv ## Traits | Trait Type | Value | |:----------------- | ----------------------:| | NumAgentStyle | SingleAgent() | | DynamicStyle | Sequential() | | InformationStyle | ImperfectInformation() | | ChanceStyle | Stochastic() | | RewardStyle | StepReward() | | UtilityStyle | GeneralSum() | | ActionStyle | MinimalActionSet() | | StateStyle | Observation{Any}() | | DefaultStateStyle | Observation{Any}() | | EpisodeStyle | Episodic() | ## Is Environment Terminated? Yes ## State Space `Base.OneTo(2)` ## Action Space `Base.OneTo(3)` ## Current State ``` 1 ```
julia> plan!(p, wrapped_env)
1
Nice job! Now we are ready to run the experiment:
julia> h = TotalRewardPerEpisode()
TotalRewardPerEpisode{Val{true}, Float64}(Float64[], 0.0, true)
julia> run(p, wrapped_env, StopAfterNEpisodes(1_000), h)
TotalRewardPerEpisode{Val{true}, Float64}([-10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0 … -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0], 0.0, true)
julia> plot(h.rewards)
Plot{Plots.PyPlotBackend() n=1}
If you are observant enough, you'll find that our policy is not updating at all!!! Actually, it's running in the actor mode. To update the policy, remember to wrap it in an Agent
.
More Complicated Environments
The above LotteryEnv
is quite simple. Many environments we are interested in fall in the same category. Beyond that, there're still many other kinds of environments. You may take a glimpse at the Built-in Environments to see how many different types of environments are supported.
To distinguish different kinds of environments, some common traits are defined in ReinforcementLearningBase.jl. Now let's explain them one-by-one.
StateStyle
In the above LotteryEnv
, state(env::LotteryEnv)
simply returns a boolean. But in some other environments, the function name state
may be kind of vague. People from different background often talk about the same thing with different names. You may be interested in this discussion: What is the difference between an observation and a state in reinforcement learning? To avoid confusion when executing state(env)
, the environment designer can explicitly define state(::AbstractStateStyle, env::YourEnv)
. So that users can fetch necessary information on demand. Following are some built-in state styles:
julia> using InteractiveUtils
julia> subtypes(RLBase.AbstractStateStyle)
4-element Vector{Any}: GoalState InformationSet InternalState Observation
Note that every state style may have many different representations, String
, Array
, Graph
and so on. All the above state styles can accept a data type as parameter. For example:
julia> RLBase.state(::Observation{String}, env::LotteryEnv) = is_terminated(env) ? "Game Over" : "Game Start"
For environments which support many different kinds of states, developers should specify all the supported state styles. For example:
julia> tp = TigerProblemEnv();
julia> StateStyle(tp)
(Observation{Int64}(), InternalState{Int64}())
julia> state(tp, Observation{Int64}())
1
julia> state(tp, InternalState{Int64}())
2
julia> state(tp)
1
DefaultStateStyle
The DefaultStateStyle
trait returns the first element in the result of StateStyle
by default.
For algorithm developers, they usually don't care about the state style. They can assume that the default state style is always well defined and simply call state(env)
to get the right representation. So for environments of many different representations, state(env)
will be dispatched to state(DefaultStateStyle(env), env)
. And we can use the DefaultStateStyleEnv
wrapper to override the pre-defined DefaultStateStyle(::YourEnv)
.
RewardStyle
For games like Chess, Go or many card game, we only get the reward at the end of an game. We say this kind of games is of TerminalReward
, otherwise we define it as StepReward
. Actually the TerminalReward
is a special case of StepReward
(for non-terminal steps, the reward is 0
). The reason we still want to distinguish these two cases is that, for some algorithms there may be a more efficient implementation for TerminalReward
style games.
julia> RewardStyle(tp)
StepReward()
julia> RewardStyle(MontyHallEnv())
TerminalReward()
ActionStyle
For some environments, the valid actions in each step may be different. We call this kind of environments are of FullActionSet
. Otherwise, we say the environment is of MinimalActionSet
. A typical built-in environment with FullActionSet
is the TicTacToeEnv
. Two extra methods must be implemented:
julia> ttt = TicTacToeEnv();
julia> ActionStyle(ttt)
FullActionSet()
julia> legal_action_space(ttt)
9-element Vector{Int64}: 1 2 3 4 5 6 7 8 9
julia> legal_action_space_mask(ttt)
9-element BitVector: 1 1 1 1 1 1 1 1 1
For some simple environments, we can simply use a Tuple
or a Vector
to describe the action space. Sometimes, the action space is not easy to be described by some built in data structures. In that case, you can defined a customized one with the following interfaces implemented:
Base.in
Random.rand
For example, to define an action space on the N dimensional simplex:
julia> using Random
julia> struct SimplexSpace n::Int end
julia> function Base.in(x::AbstractVector, s::SimplexSpace) length(x) == s.n && all(>=(0), x) && isapprox(1, sum(x)) end
julia> function Random.rand(rng::AbstractRNG, s::SimplexSpace) x = rand(rng, s.n) x ./= sum(x) x end
NumAgentStyle
In the above LotteryEnv
, only one player is involved in the environment. In many board games, usually multiple players are engaged.
julia> NumAgentStyle(env)
SingleAgent()
julia> NumAgentStyle(ttt)
MultiAgent{2}()
For multi-agent environments, some new APIs are introduced. The meaning of some APIs we've seen are also extended. First, multi-agent environment developers must implement players
to distinguish different players.
julia> players(ttt)
(Player(:Cross), Player(:Nought))
julia> current_player(ttt)
Player(:Cross)
Single Agent | Multi-Agent |
---|---|
state(env) | state(env, player) |
reward(env) | reward(env, player) |
env(action) | env(action, player) |
action_space(env) | action_space(env, player) |
state_space(env) | state_space(env, player) |
is_terminated(env) | is_terminated(env, player) |
Note that the APIs in single agent is still valid, only that they all fall back to the perspective from the current_player(env)
.
UtilityStyle
In multi-agent environments, sometimes the sum of rewards from all players are always 0
. We call the UtilityStyle
of these environments ZeroSum
. ZeroSum
is a special case of ConstantSum
. In cooperative games, the reward of each player are the same. In this case, they are called IdenticalUtility
. Other cases fall back to GeneralSum
.
InformationStyle
If all players can see the same state, then we say the InformationStyle
of these environments are of PerfectInformation
. They are a special case of ImperfectInformation
environments.
DynamicStyle
All the environments we've seen so far were of Sequential
style, meaning that at each step, only ONE player was allowed to take an action. Alternatively there are Simultaneous
environments, where all the players take actions simultaneously without seeing each other's action in advance. Simultaneous environments must take a collection of actions from different players as input.
julia> rps = RockPaperScissorsEnv();
julia> action_space(rps)
(('💎', '💎'), ('💎', '📃'), ('💎', '✂'), ('📃', '💎'), ('📃', '📃'), ('📃', '✂'), ('✂', '💎'), ('✂', '📃'), ('✂', '✂'))
julia> action = plan!(RandomPolicy(), rps)
('✂', '📃')
julia> act!(rps, action)
true
ChanceStyle
If there's no rng
in the environment, everything is deterministic after taking each action, then we call the ChanceStyle
of these environments are of Deterministic
. Otherwise, we call them Stochastic
, which is the default return value. One special case is that, in Extensive Form Games, a chance node is involved. And the action probability of this special player is determined. We define the ChanceStyle
of these environments as EXPLICIT_STOCHASTIC
. For these environments, we need to have the following methods defined:
julia> kp = KuhnPokerEnv();
julia> chance_player(kp)
ChancePlayer()
julia> prob(kp, chance_player(kp))
3-element Vector{Float64}: 0.3333333333333333 0.3333333333333333 0.3333333333333333
julia> chance_player(kp) in players(kp)
true
To explicitly specify the chance style of your custom environment, you can provide a specific dispatch of ChanceStyle
for your custom environment.
Examples
Finally we've gone through all the details you need to know for how to write a customized environment. You're encouraged to take a look at the examples provided in ReinforcementLearningEnvironments.jl. Feel free to create an issue there if you're still not sure how to describe your problem with the interfaces defined in this package.