Episodic vs Non-episodic environments
Episodic environments
By default, run(policy, env, stop_condition, hook)
will step through env
until a terminal state is reached, signaling the end of an episode. To be able to do so, env
must implement the RLBase.is_terminated(::YourEnvironment)
function. This function is called after each step through the environment and when it returns true
, the trajectory records the terminal state, then the RLBase.reset!(::YourEnvironment)
function is called and the environment is set to (one of) its initial state(s).
Using this means that the value of the terminal state is set to 0 when learning its value via boostrapping.
Non-episodic environment
Also called Continuing tasks (Sutton & Barto, 2018), non-episodic environment do not have a terminal state and thus may run for ever, or until the stop_condition
is reached. Sometimes however, one may want to periodically reset the environment to start fresh. A first possibility is to implement RLBase.is_terminated(::YourEnvironment)
to reset according to an arbitrary condition. However this may not be a good idea because the value of the last state (note that it is not a terminal state) will be bootstrapped to 0 during learning, even though it is not the true value of the state.
To manage this, we provide the ResetAfterNSteps(n)
condition as an argument to run(policy, env, stop_condition, hook, reset_condition = ResetIfEnvTerminated())
. The default ResetIfEnvTerminated()
assumes an episodic environment, changing that to ResetAfterNSteps(n)
will no longer check is_terminated
but will instead call reset!
every n
steps. This way, the value of the last state will not be multiplied by 0 during bootstrapping and the correct value can be learned.
Custom reset conditions
You can specify a custom reset_condition
instead of using the built-in's. Your condition must be callable with the method RLCore.check!(my_condition, policy, env)
. For example, here is how to implement a custom condition that checks for a terminal state but will also reset if the episode is too long:
using ReinforcementLearning
import ReinforcementLearning: RLCore
reset_n_steps = ResetAfterNSteps(10000)
struct MyCondition <: AbstractResetCondition end
function RLCore.check!(my_condition::MyCondition, policy, env)
terminal = is_terminated(env)
too_long = RLCore.check!(reset_n_steps, policy, env)
return terminal || too_long
end
env = RandomWalk1D()
agent = RandomPolicy()
stop_condition = StopIfEnvTerminated()
hook = EmptyHook()
run(agent, env, stop_condition, hook, MyCondition())
We can instead make a callable struct instead of a function to avoid the global reset_n_step
.
mutable struct MyCondition1 <: AbstractResetCondition
reset_after
end
RLCore.check!(c::MyCondition1, policy, env) = is_terminated(env) || RLCore.check!(c.reset_after, policy, env)
run(agent, env, stop_condition, hook, MyCondition1(ResetAfterNSteps(10000)))