Chapter 6.7 Maximization Bias and Double Learning

In example 6.7, authors introduced a MDP problem to compare the different performance of Q-Learning and Double-Q-Learning. This environment is kind of special compared to the environments we have seen before. In the first step, only LEFT and RIGHT are allowed. In the second step, if the LEFT is chosen previously, then we have 10 valid actions. We call this kind of environment is of FULL_ACTION_SET.

16.1 μs
50.7 s
MaximizationBiasEnv
5.2 ms
33.5 μs
21.3 μs
18.2 μs
1
2.5 μs
2
1.1 μs
26.5 μs
28.0 μs
151 μs
20.4 μs
18.3 μs
22.4 μs
20.3 μs

Now the environment is well defined.

7.0 μs
world
# MaximizationBiasEnv

## Traits

| Trait Type        |                                            Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle     |          ReinforcementLearningBase.SingleAgent() |
| DynamicStyle      |           ReinforcementLearningBase.Sequential() |
| InformationStyle  | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle       |           ReinforcementLearningBase.Stochastic() |
| RewardStyle       |           ReinforcementLearningBase.StepReward() |
| UtilityStyle      |           ReinforcementLearningBase.GeneralSum() |
| ActionStyle       |        ReinforcementLearningBase.FullActionSet() |
| StateStyle        |     ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle |     ReinforcementLearningBase.Observation{Any}() |

## Is Environment Terminated?

No

## State Space

`Base.OneTo(3)`

## Action Space

`Base.OneTo(10)`

## Current State

```
1
```
2.1 ms
94.0 ns

To calculate the percentage of chosing LEFT action in the first step, we'll create customized hook here:

5.7 μs
CountOfLeft
5.3 ms
127 μs

Next we create two agent factories, Q-Learning and Double-Q-Learning.

5.3 μs
create_double_Q_agent (generic function with 1 method)
50.5 μs
create_Q_agent (generic function with 1 method)
38.6 μs
11.8 s