Chapter 9 On-policy Prediction with Approximation

In this notebook, we'll focus on the linear approximation methods.

13.9 μs
61.0 s

Figure 9.1

We've discussed the RandomWalk1D environment before. In previous example, the state space is relatively small (1:7). Here we expand it into 1:1000 and see how the LinearVApproximator will work here.

11.3 μs
ACTIONS
31.5 ms
NA
200
301 ns
NS
1002
88.0 ns

First, let's roll out a large experiment to calculate the true state values of each state:

10.7 μs
TRUE_STATE_VALUES
44.4 s
22.8 s

Next, we define a preprocessor to map adjacent states into groups.

6.1 μs
N_GROUPS
10
80.0 ns
GroupMapping
8.3 ms
256 μs
421 ms

To count the frequency of each state, we need to write a hook.

6.0 μs
3.3 ms
226 μs

Now let's kickoff our experiment:

5.6 μs
agent_1
Agent
├─ policy => VBasedPolicy
│  ├─ learner => MonteCarloLearner
│  │  ├─ approximator => TabularApproximator
│  │  │  ├─ table => 12-element Array{Float64,1}
│  │  │  └─ optimizer => Descent
│  │  │     └─ eta => 2.0e-5
│  │  ├─ γ => 1.0
│  │  ├─ kind => ReinforcementLearningZoo.EveryVisit
│  │  └─ sampling => ReinforcementLearningZoo.NoSampling
│  └─ mapping => Main.var"#3#4"
└─ trajectory => Trajectory
   └─ traces => NamedTuple
      ├─ state => 0-element Array{Int64,1}
      ├─ action => 0-element Array{Int64,1}
      ├─ reward => 0-element Array{Float32,1}
      └─ terminal => 0-element Array{Bool,1}
79.1 ms
env_1
# RandomWalk1D |> StateOverriddenEnv

## Traits

| Trait Type        |                                          Value |
|:----------------- | ----------------------------------------------:|
| NumAgentStyle     |        ReinforcementLearningBase.SingleAgent() |
| DynamicStyle      |         ReinforcementLearningBase.Sequential() |
| InformationStyle  | ReinforcementLearningBase.PerfectInformation() |
| ChanceStyle       |      ReinforcementLearningBase.Deterministic() |
| RewardStyle       |     ReinforcementLearningBase.TerminalReward() |
| UtilityStyle      |         ReinforcementLearningBase.GeneralSum() |
| ActionStyle       |   ReinforcementLearningBase.MinimalActionSet() |
| StateStyle        | ReinforcementLearningBase.Observation{Int64}() |
| DefaultStateStyle | ReinforcementLearningBase.Observation{Int64}() |

## Is Environment Terminated?

No

## State Space

`Base.OneTo(1002)`

## Action Space

`Base.OneTo(200)`

## Current State

```
6
```
8.1 ms
hook
12.3 ms
1.2 s
2.0 s

Figure 9.2

6.2 μs
agent_2
Agent
├─ policy => VBasedPolicy
│  ├─ learner => TDLearner
│  │  ├─ approximator => TabularApproximator
│  │  │  ├─ table => 12-element Array{Float64,1}
│  │  │  └─ optimizer => Descent
│  │  │     └─ eta => 0.0002
│  │  ├─ γ => 1.0
│  │  ├─ method => SRS
│  │  └─ n => 0
│  └─ mapping => Main.var"#5#6"
└─ trajectory => Trajectory
   └─ traces => NamedTuple
      ├─ state => 0-element Array{Int64,1}
      ├─ action => 0-element Array{Int64,1}
      ├─ reward => 0-element Array{Float32,1}
      └─ terminal => 0-element Array{Bool,1}
33.6 ms
33.1 s
26.0 ms

Figure 9.2 right

6.6 μs
1.7 ms
147 μs
n_groups
20
45.0 ns
run_once (generic function with 1 method)
62.0 μs
48.4 s

Figure 9.5

5.8 μs
3.9 ms
154 μs
2.9 ms
139 μs
run_once_MC (generic function with 1 method)
54.6 μs