v0.6
around 2017. (Not a serious RL researcher, correct me if I'm wrong)
(Non-native speaker, interrupt me if I don't make myself clear)
TabularReinforcementLearning.jl
at first Far from perfect.
(Most of them will be addressed in the next release this summer!)
The standard setting for reinforcement learning contains two parts:
env |> policy |> env
More specifically:
Let's see a more concrete example:
]activate .
]st
using ReinforcementLearning
using StableRNGs
using Flux
using Flux.Losses
using IntervalSets
using CUDA
using Plots
env = RandomWalk1D();
state(env)
reward(env) # though it is meaningless since we haven't applied any action yet
action_space(env)
env(2)
state(env)
Actually, the policy could be arbitraray callable object which takes in an environment and returns an action.
policy = RandomPolicy(action_space(env))
[policy(env) for _ in 1:10]
Now we have both police and environments.
run(policy, env)
It seems nothing happened???
state(env), reward(env), is_terminated(env)
@which run(policy, env) # run until the end of an episode
run(policy, env, StopAfterNEpisodes(10))
Well, in that case, you need to implement your customized stop condition here. In RL.jl, several common ones are already provided, like:
StopAfterNSteps
StopAfterNEpisodes
StopAfterNSeconds
How to implement the customized stop condition is out-of-scope today. In short, you only need to implement (condition::YourCondition)(agent, env)
to return true
when you want to stop the run
function.
More specifically,
To answer these questions, many hooks are introduced in RL.jl.
run(policy, env, StopAfterNEpisodes(10), TotalRewardPerEpisode())
In most cases, when you execute run(policy, env, stop_condition, hook)
, the policy is running in the actor mode. Or someone would call it the test mode. Because here the policy won't try to optimize itself.
(As in the picture above, the data from the environment flows into the policy directly)
Let's see another example:
S, A = state_space(env), action_space(env)
NS, NA = length(S), length(A)
policy = QBasedPolicy(
learner = MonteCarloLearner(;
approximator=TabularQApproximator(
;n_state = NS,
n_action = NA,
opt = InvDecay(1.0)
)
),
explorer = EpsilonGreedyExplorer(0.1)
)
policy(env)
run(
policy,
RandomWalk1D(),
StopAfterNEpisodes(10),
TotalRewardPerEpisode()
)
[policy.learner(s) for s in state_space(env)] # the Q-value estimation for each state
Remember that our explorer is an
EpsilonGreedyExplorer(0.1)
, which will select the turn-left action (the first one of the maximum Q-value) with probability of0.95
here
To Optimize the policy, a special policy wrapper is provided in RL.jl: the Agent.
agent = Agent(
policy = policy,
trajectory = VectorSARTTrajectory()
);
The data flow is now changed.
run(agent, env, StopAfterNEpisodes(10), TotalRewardPerEpisode())
Let's take a look into the total length of each episode:
hook = StepsPerEpisode()
run(agent, env, StopAfterNEpisodes(10), hook)
plot(hook.steps[1:end-1])
Obviously our best policy only needs 3 steps.
agent
hook = StepsPerEpisode()
run(
QBasedPolicy(
learner=policy.learner,
explorer=GreedyExplorer()
),
env,
StopAfterNEpisodes(10),
hook
)
plot(hook.steps[1:end-1])
Any questions until now?
To apply RL.jl in real world problems:
]dev ../Trajectories.jl
using Trajectories
t = Trajectories.CircularArraySARTSTraces(;capacity=10)
push!(t; state=4, action=1)
t
push!(t; reward=0., terminal=false, state=3, action=1)
t
push!(t; reward=0., terminal=false, state=2, action=1, )
t
push!(t; reward=-1., terminal=true, state=1, action=1) # this action is meaningness
t
Note that all these behaviors are handled by
Agent
Although the (state, action, reward, terminal)
(aks SART
here) is the most common one,
each algorithm may have different information to store. You can always create your own ones.
Thanks to Henri Dehaybe, just wrote how to implement a new algorithm. So I won't cover it here.
env = CartPoleEnv();
plot(env)
ns, na = length(state(env)), length(action_space(env))
rng = StableRNG(123)
policy = Agent(
policy = QBasedPolicy(
learner = BasicDQNLearner(
approximator = NeuralNetworkApproximator(
model = Chain(
Dense(ns, 128, relu; init = glorot_uniform(rng)),
Dense(128, 128, relu; init = glorot_uniform(rng)),
Dense(128, na; init = glorot_uniform(rng)),
) |> cpu,
optimizer = Adam(),
),
batchsize = 32,
min_replay_history = 100,
loss_func = huber_loss,
rng = rng,
),
explorer = EpsilonGreedyExplorer(
kind = :exp,
ϵ_stable = 0.01,
decay_steps = 500,
rng = rng,
),
),
trajectory = CircularArraySARTTrajectory(
capacity = 1000,
state = Vector{Float32} => (ns,),
),
);
stop_condition = StopAfterNSteps(10_000)
hook = TotalRewardPerEpisode()
run(policy, env, stop_condition, hook)
rng = StableRNG(123)
N_ENV = 8
UPDATE_FREQ = 32
env = MultiThreadEnv([
CartPoleEnv(; T = Float32, rng = StableRNG(hash(123 + i))) for i in 1:N_ENV
]);
RLBase.reset!(env, is_force = true)
agent = Agent(
policy = PPOPolicy(
approximator = ActorCritic(
actor = Chain(
Dense(ns, 256, relu; init = glorot_uniform(rng)),
Dense(256, na; init = glorot_uniform(rng)),
),
critic = Chain(
Dense(ns, 256, relu; init = glorot_uniform(rng)),
Dense(256, 1; init = glorot_uniform(rng)),
),
optimizer = Adam(1e-3),
) |> cpu,
γ = 0.99f0,
λ = 0.95f0,
clip_range = 0.1f0,
max_grad_norm = 0.5f0,
n_epochs = 4,
n_microbatches = 4,
actor_loss_weight = 1.0f0,
critic_loss_weight = 0.5f0,
entropy_loss_weight = 0.001f0,
update_freq = UPDATE_FREQ,
),
trajectory = PPOTrajectory(;
capacity = UPDATE_FREQ,
state = Matrix{Float32} => (ns, N_ENV),
action = Vector{Int} => (N_ENV,),
action_log_prob = Vector{Float32} => (N_ENV,),
reward = Vector{Float32} => (N_ENV,),
terminal = Vector{Bool} => (N_ENV,),
),
);
Note the type of policy
and Trajectory
.
stop_condition = StopAfterNSteps(10_000)
hook = TotalBatchRewardPerEpisode(N_ENV)
run(agent, env, stop_condition, hook)
PPO is one of the most commonly used and also studied algorithms.
The 37 Implementation Details of Proximal Policy Optimization
Since RL.jl is very flexible, most of them are pretty easy/straightforward to implement with RL.jl.
For most simple environments, the interfaces defined in CommonRLInterface.jl would be enough.
reset!(env) # returns nothing
actions(env) # returns the set of all possible actions for the environment
observe(env) # returns an observation
act!(env, a) # steps the environment forward and returns a reward
terminated(env) # returns true or false indicating whether the environment has finished
Or if you prefer the interfaces defiend in RLBase
, you can find a lot of examples at RLEnvs.
mutable struct MultiArmBanditsEnv <: AbstractEnv
true_reward::Float64
true_values::Vector{Float64}
rng::AbstractRNG
reward::Float64
is_terminated::Bool
end
function MultiArmBanditsEnv(; true_reward = 0.0, k = 10, rng = Random.GLOBAL_RNG)
true_values = true_reward .+ randn(rng, k)
MultiArmBanditsEnv(true_reward, true_values, rng, 0.0, false)
end
RLBase.action_space(env::MultiArmBanditsEnv) = Base.OneTo(length(env.true_values))
RLBase.state(env::MultiArmBanditsEnv, ::Observation, ::DefaultPlayer) = 1
RLBase.state_space(env::MultiArmBanditsEnv) = Base.OneTo(1)
RLBase.is_terminated(env::MultiArmBanditsEnv) = env.is_terminated
RLBase.reward(env::MultiArmBanditsEnv) = env.reward
RLBase.reset!(env::MultiArmBanditsEnv) = env.is_terminated = false
Several important things to consider:
To be honest, RL.jl is hard to complete with many other mature packages written in Python. Part of the reason is that we lack active contributors & users (kind of a chicken-or-egg problem).
Thank you all!
(Contribution or Cooperation are warmly welcomed!)
HTML(
"""
<style>
@font-face {
font-family: JuliaMono-Regular;
src: url("https://cdn.jsdelivr.net/gh/cormullion/juliamono/webfonts/JuliaMono-Regular.woff2");
}
.rendered_html table{
font-size: 16px !important;
}
div.input_area {
background: #def !important;
font-size: 16px !important;
}
div.output_area pre{
background: #def !important;
font-size: 16px !important;
font-family: JuliaMono-Regular;
}
.CodeMirror {
font-size: 16px !important;
font-family: "JuliaMono-Regular" !important;
font-feature-settings: "zero", "ss01";
font-variant-ligatures: contextual;
}
</style>
"""
)