Ten Armed Bandits Environment

In this chapter, we'll use the MultiArmBanditsEnv to study two main concepts in reinforcement learning: exploration and exploitation.

Let's take a look at the environment first.

11.7 μs
38.9 s
env
# MultiArmBanditsEnv

## Traits

| Trait Type        |                                            Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle     |          ReinforcementLearningBase.SingleAgent() |
| DynamicStyle      |           ReinforcementLearningBase.Sequential() |
| InformationStyle  | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle       |           ReinforcementLearningBase.Stochastic() |
| RewardStyle       |       ReinforcementLearningBase.TerminalReward() |
| UtilityStyle      |           ReinforcementLearningBase.GeneralSum() |
| ActionStyle       |     ReinforcementLearningBase.MinimalActionSet() |
| StateStyle        |   ReinforcementLearningBase.Observation{Int64}() |
| DefaultStateStyle |   ReinforcementLearningBase.Observation{Int64}() |

## Is Environment Terminated?

No

## State Space

`Base.OneTo(1)`

## Action Space

`Base.OneTo(10)`

## Current State

```
1
```
188 ms
8.1 s
30.4 s
23.8 s

The above figure shows the reward distribution of each action. (Figure 2.1)

7.3 μs

Now we create a testbed to calulate the average reward and perfect action percentage.

3.1 μs
CollectBestActions
85.3 ms
139 μs

Writing a customized hook is easy.

  1. Define your struct and make it inherit from AbstractHook (optional).

  2. Write your customized runtime logic by overwriting some of the following functions. By default, they will do nothing if your hook inherits from AbstractHook.

    • (h::YourHook)(::PreActStage, agent, env, action)

    • (h::YourHook)(::PostActStage, agent, env)

    • (h::YourHook)(::PreEpisodeStage, agent, env)

    • (h::YourHook)(::PostEpisodeStage, agent, env)

2.6 ms
688 ms
271 μs
bandit_testbed (generic function with 1 method)
94.7 μs
89.2 s
51.2 s
128 s

Similar to the bandit_testbed function, we'll create a new function to test the performance of GradientBanditLearner.

7.3 μs
gb_bandit_testbed (generic function with 1 method)
93.4 μs

Note that there's a keyword argument named baseline in the GradientBanditLearner. It can be either a number or a callable function (reward -> value). One of such functions mentioned in the book is to calculate the average of seen rewards.

3.7 μs
SampleAvg
4.8 ms
148 μs
29.7 ms
26.9 s
546 s