Chapter 5 The Blackjack Environment

In this notebook, we'll study monte carlo based methods to play the Blackjack game.

9.1 μs
44.8 s

As usual, let's define the environment first. The implementation of the Blackjack environment is mainly taken from openai/gym with some necessary modifications for our following up experiments.

8.1 μs
4.3 ms
game
# BlackjackEnv

## Traits

| Trait Type        |                                            Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle     |          ReinforcementLearningBase.SingleAgent() |
| DynamicStyle      |           ReinforcementLearningBase.Sequential() |
| InformationStyle  | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle       |           ReinforcementLearningBase.Stochastic() |
| RewardStyle       |           ReinforcementLearningBase.StepReward() |
| UtilityStyle      |           ReinforcementLearningBase.GeneralSum() |
| ActionStyle       |     ReinforcementLearningBase.MinimalActionSet() |
| StateStyle        |     ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle |     ReinforcementLearningBase.Observation{Any}() |

## Is Environment Terminated?

No

## State Space

`ReinforcementLearningBase.Space{Array{Base.OneTo{Int64},1}}(Base.OneTo{Int64}[Base.OneTo(31), Base.OneTo(10), Base.OneTo(2)])`

## Action Space

`Base.OneTo(2)`

## Current State

```
(13, 6, 2)
```
215 ms
1.5 μs

As you can see, the state_space of the Blackjack environment has 3 discrete features. To reuse the tabular algorithms in ReinforcementLearning.jl, we need to flatten the state and wrap it in a StateOverriddenEnv.

4.6 μs
STATE_MAPPING
#1 (generic function with 1 method)
4.8 μs
world
# BlackjackEnv |> StateOverriddenEnv

## Traits

| Trait Type        |                                            Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle     |          ReinforcementLearningBase.SingleAgent() |
| DynamicStyle      |           ReinforcementLearningBase.Sequential() |
| InformationStyle  | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle       |           ReinforcementLearningBase.Stochastic() |
| RewardStyle       |           ReinforcementLearningBase.StepReward() |
| UtilityStyle      |           ReinforcementLearningBase.GeneralSum() |
| ActionStyle       |     ReinforcementLearningBase.MinimalActionSet() |
| StateStyle        |     ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle |     ReinforcementLearningBase.Observation{Any}() |

## Is Environment Terminated?

No

## State Space

`ReinforcementLearningBase.Space{Array{Base.OneTo{Int64},1}}(Base.OneTo{Int64}[Base.OneTo(31), Base.OneTo(10), Base.OneTo(2)])`

## Action Space

`Base.OneTo(2)`

## Current State

```
233
```
7.5 ms
32.2 μs
NS
Base.OneTo(620)
66.0 ns

Figure 5.1

2.6 μs
agent
Agent
├─ policy => VBasedPolicy
│  ├─ learner => MonteCarloLearner
│  │  ├─ approximator => TabularApproximator
│  │  │  ├─ table => 620-element Array{Float64,1}
│  │  │  └─ optimizer => InvDecay
│  │  │     ├─ gamma => 1.0
│  │  │     └─ state => IdDict
│  │  ├─ γ => 1.0
│  │  ├─ kind => ReinforcementLearningZoo.FirstVisit
│  │  └─ sampling => ReinforcementLearningZoo.NoSampling
│  └─ mapping => Main.var"#3#4"
└─ trajectory => Trajectory
   └─ traces => NamedTuple
      ├─ state => 0-element Array{Int64,1}
      ├─ action => 0-element Array{Int64,1}
      ├─ reward => 0-element Array{Float32,1}
      └─ terminal => 0-element Array{Bool,1}
132 ms
5.6 s
VT
132 ns
80.0 ns
16.3 s
24.2 ms
5.7 s
24.9 ms
24.8 ms

Figure 5.2

2.5 μs

In Chapter 5.3, a Monte Carlo Exploring Start method is used to solve the Blackjack game. Although several variants of monte carlo methods are supported in ReinforcementLearning.jl package, they do not support the exploring start. Nevertheless, we can define it very easily.

6.3 μs
22.1 ms
solver
Agent
├─ policy => QBasedPolicy
│  ├─ learner => MonteCarloLearner
│  │  ├─ approximator => TabularApproximator
│  │  │  ├─ table => 2×620 Array{Float64,2}
│  │  │  └─ optimizer => InvDecay
│  │  │     ├─ gamma => 1.0
│  │  │     └─ state => IdDict
│  │  ├─ γ => 1.0
│  │  ├─ kind => ReinforcementLearningZoo.FirstVisit
│  │  └─ sampling => ReinforcementLearningZoo.NoSampling
│  └─ explorer => GreedyExplorer
└─ trajectory => Trajectory
   └─ traces => NamedTuple
      ├─ state => 0-element Array{Int64,1}
      ├─ action => 0-element Array{Int64,1}
      ├─ reward => 0-element Array{Float32,1}
      └─ terminal => 0-element Array{Bool,1}
74.4 ms
37.5 s
QT
2×620 Array{Float64,2}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
525 ns
724 ms
24.1 ms
V_agent
Agent
├─ policy => VBasedPolicy
│  ├─ learner => MonteCarloLearner
│  │  ├─ approximator => TabularApproximator
│  │  │  ├─ table => 620-element Array{Float64,1}
│  │  │  └─ optimizer => InvDecay
│  │  │     ├─ gamma => 1.0
│  │  │     └─ state => IdDict
│  │  ├─ γ => 1.0
│  │  ├─ kind => ReinforcementLearningZoo.FirstVisit
│  │  └─ sampling => ReinforcementLearningZoo.NoSampling
│  └─ mapping => Main.var"#17#18"
└─ trajectory => Trajectory
   └─ traces => NamedTuple
      ├─ state => 0-element Array{Int64,1}
      ├─ action => 0-element Array{Int64,1}
      ├─ reward => 0-element Array{Float32,1}
      └─ terminal => 0-element Array{Bool,1}
37.8 ms
4.2 s
283 ns
23.2 ms
23.1 ms

Figure 5.3

2.6 μs
17.5 ms
INIT_STATE
354
8.4 μs
GOLD_VAL
-0.27726
75.0 ns
StoreMSE
4.9 ms
138 μs
target_policy_mapping
#23 (generic function with 1 method)
78.0 ns
56.1 μs
ordinary_mse (generic function with 1 method)
44.6 μs
weighted_mse (generic function with 1 method)
51.8 μs