Chapter06 Temporal-Difference Learning (Cliff Walking)

6.0 μs
52.8 s

In Example 6.6, a gridworld example of Cliff Walking is introduced to compare the difference between on-policy (SARSA) and off-policy (Q-learning). Although there's a package of GridWorlds.jl dedicated to 2-D environments, we decide to write an independent implementation here as a showcase.

7.4 μs
4×12 LinearIndices{2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}}:
 1  5   9  13  17  21  25  29  33  37  41  45
 2  6  10  14  18  22  26  30  34  38  42  46
 3  7  11  15  19  23  27  31  35  39  43  47
 4  8  12  16  20  24  28  32  36  40  44  48
57.2 ms
iscliff (generic function with 1 method)
39.5 μs
18.6 s
CliffWalkingEnv
6.0 ms
157 μs
26.3 μs
19.4 μs
31.0 μs
20.8 μs
19.4 μs
41.7 μs
world
# CliffWalkingEnv

## Traits

| Trait Type        |                                            Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle     |          ReinforcementLearningBase.SingleAgent() |
| DynamicStyle      |           ReinforcementLearningBase.Sequential() |
| InformationStyle  | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle       |           ReinforcementLearningBase.Stochastic() |
| RewardStyle       |           ReinforcementLearningBase.StepReward() |
| UtilityStyle      |           ReinforcementLearningBase.GeneralSum() |
| ActionStyle       |     ReinforcementLearningBase.MinimalActionSet() |
| StateStyle        |     ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle |     ReinforcementLearningBase.Observation{Any}() |

## Is Environment Terminated?

No

## State Space

`Base.OneTo(48)`

## Action Space

`Base.OneTo(4)`

## Current State

```
4
```
2.5 ms

Now we have a workable environment. Next we create several factories to generate different policies for comparison.

3.0 μs
4
66.0 ns
create_agent (generic function with 1 method)
59.1 μs
repeated_run (generic function with 2 methods)
43.8 μs
118 s
384 s