77.3 s

The Baird Count Environment

6.2 μs
6.2 ms

Off Policy

3.5 μs
84.0 ns
83.0 ns
83.0 ns

Figure 11.2

8.0 μs
world
# BairdCounterEnv

## Traits

| Trait Type        |                                            Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle     |          ReinforcementLearningBase.SingleAgent() |
| DynamicStyle      |           ReinforcementLearningBase.Sequential() |
| InformationStyle  | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle       |           ReinforcementLearningBase.Stochastic() |
| RewardStyle       |           ReinforcementLearningBase.StepReward() |
| UtilityStyle      |           ReinforcementLearningBase.GeneralSum() |
| ActionStyle       |     ReinforcementLearningBase.MinimalActionSet() |
| StateStyle        |     ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle |     ReinforcementLearningBase.Observation{Any}() |

## Is Environment Terminated?

No

## State Space

`Base.OneTo(7)`

## Action Space

`Base.OneTo(2)`

## Current State

```
1
```
136 ms
10.2 ms
NW
8
79.0 ns
INIT_WEIGHT
1.5 μs
10
301 ns
STATE_MAPPING
8×7 Array{Float64,2}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0
1.9 μs
2
245 ns
8×7 Array{Float64,2}:
 2.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  2.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  2.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  2.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  2.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  2.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  2.0
81.0 ns
π_b
#1 (generic function with 1 method)
68.0 ns
π_t
VBasedPolicy
├─ learner => TDLearner
│  ├─ approximator => LinearApproximator
│  │  ├─ weights => 8-element Array{Float64,1}
│  │  └─ optimizer => Descent
│  │     └─ eta => 0.01
│  ├─ γ => 0.99
│  ├─ method => SRS
│  └─ n => 0
└─ mapping => Main.var"#3#4"
64.5 ms
prob_b
1.2 μs
prob_t
1.5 μs

Well, I must admit it is a little tricky here.

7.1 μs
36.3 μs
42.9 μs
agent
Agent
├─ policy => OffPolicy
│  ├─ π_target => VBasedPolicy
│  │  ├─ learner => TDLearner
│  │  │  ├─ approximator => LinearApproximator
│  │  │  │  ├─ weights => 8-element Array{Float64,1}
│  │  │  │  └─ optimizer => Descent
│  │  │  │     └─ eta => 0.01
│  │  │  ├─ γ => 0.99
│  │  │  ├─ method => SRS
│  │  │  └─ n => 0
│  │  └─ mapping => Main.var"#3#4"
│  └─ π_behavior => Main.var"#1#2"
└─ trajectory => Trajectory
   └─ traces => NamedTuple
      ├─ weight => 0-element Array{Float64,1}
      ├─ state => 0-element Array{Any,1}
      ├─ action => 0-element Array{Int64,1}
      ├─ reward => 0-element Array{Float32,1}
      └─ terminal => 0-element Array{Bool,1}
198 ms
new_env
# BairdCounterEnv |> StateOverriddenEnv

## Traits

| Trait Type        |                                            Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle     |          ReinforcementLearningBase.SingleAgent() |
| DynamicStyle      |           ReinforcementLearningBase.Sequential() |
| InformationStyle  | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle       |           ReinforcementLearningBase.Stochastic() |
| RewardStyle       |           ReinforcementLearningBase.StepReward() |
| UtilityStyle      |           ReinforcementLearningBase.GeneralSum() |
| ActionStyle       |     ReinforcementLearningBase.MinimalActionSet() |
| StateStyle        |     ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle |     ReinforcementLearningBase.Observation{Any}() |

## Is Environment Terminated?

No

## State Space

`Base.OneTo(7)`

## Action Space

`Base.OneTo(2)`

## Current State

```
[0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
```
10.6 ms
hook
47.3 ms
776 ms
3.6 s