Chapter 8.6 Trajectory Sampling

The general function run(policy, env, stop_condition, hook) is very flexible and powerful. However, we are not restricted to use it only. In this notebook, we'll see how to use part of the components provided in ReinforcementLearning.jl to finish some specific experiments.

First, let's define the environment mentioned in Chapter 8.6:

13.5 μs
59.1 s
4.9 ms

Note that this environment is not described very clearly on the book. Part of the information are inferred from the lisp source code.

Info

Actually the lisp code is also not perfect, I spent a whole afternoon to figure out the code logic. So good luck if you also want to understand it.

The definitions above are just like any other environment we've defined before in previous chapters. Now we'll add an extra function to make it work for our planning purpose.

9.9 μs
Main.workspace46.successors
334 μs
γ
0.9
76.0 ns
n_sweep
10
77.0 ns
Main.workspace46.eval_Q
204 μs
Main.workspace46.gain
119 μs
sweep (generic function with 1 method)
81.2 μs
on_policy (generic function with 1 method)
79.2 μs
88.5 s
173 s