Chapter 10.3 Access Control

In Chapter 10.3, an algorithm of Differential semi-gradient Sarsa for estimating q̂ ≈ q⋆. This one is not included in ReinforcementLearning.jl. Here we'll use it as an example to demonstrate how easy it is to extend components in ReinforcementLearning.jl.

9.7 μs
68.6 s

Implement the DifferentialTDLearner

First let's define a DifferentialTDLearner. It will be used to estimate Q values. So we have to implement (L::DifferentialTDLearner)(env::AbstractEnv), which simply forward the current state to the inner approximator.

4.6 μs
33.3 ms
253 μs

Now based on the definition of this algorithm on the book, we can implement the updating logic as follows:

3.1 μs
98.9 μs

Next, we dispatch some runtime logic to our specific learner to make sure the above update! function is called at the right time.

3.2 μs
150 μs

The above function specifies that we only update the DifferentialTDLearnerTDLearner at the PreActStage and the PostEpisodeStage. Also note that we don't need to push all the transitions into the trajectory at each step. So we can empty! it at the start of an episode:

3.9 μs
403 μs

Access Control Environment

Before evaluating the learner implemented above, we have to first define the environment.

3.7 μs
60.0 ms
world
# AccessControlEnv

## Traits

| Trait Type        |                                            Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle     |          ReinforcementLearningBase.SingleAgent() |
| DynamicStyle      |           ReinforcementLearningBase.Sequential() |
| InformationStyle  | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle       |           ReinforcementLearningBase.Stochastic() |
| RewardStyle       |           ReinforcementLearningBase.StepReward() |
| UtilityStyle      |           ReinforcementLearningBase.GeneralSum() |
| ActionStyle       |     ReinforcementLearningBase.MinimalActionSet() |
| StateStyle        |     ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle |     ReinforcementLearningBase.Observation{Any}() |

## Is Environment Terminated?

No

## State Space

`Base.OneTo(44)`

## Action Space

`Base.OneTo(2)`

## Current State

```
23
```
210 ms
NS
44
68.0 ns
NA
2
75.0 ns
agent
Agent
├─ policy => QBasedPolicy
│  ├─ learner => DifferentialTDLearner
│  │  ├─ approximator => TabularApproximator
│  │  │  ├─ table => 2×44 Array{Float64,2}
│  │  │  └─ optimizer => Descent
│  │  │     └─ eta => 0.01
│  │  ├─ β => 0.01
│  │  └─ R̄ => 0.0
│  └─ explorer => EpsilonGreedyExplorer
│     ├─ ϵ_stable => 0.1
│     ├─ ϵ_init => 1.0
│     ├─ warmup_steps => 0
│     ├─ decay_steps => 0
│     ├─ step => 1
│     ├─ rng => Random._GLOBAL_RNG
│     └─ is_training => true
└─ trajectory => Trajectory
   └─ traces => NamedTuple
      ├─ state => 0-element Array{Int64,1}
      ├─ action => 0-element Array{Int64,1}
      ├─ reward => 0-element Array{Float32,1}
      └─ terminal => 0-element Array{Bool,1}
214 ms
3.1 s
25.7 s