Implement Multi-Agent Reinforcement Learning Algorithms in Julia

This is a technical report of the summer OSPP project Implement Multi-Agent Reinforcement Learning Algorithms in Julia. In this report, the following two parts are covered: the first section is a basic introduction to the project, and the second section contains the implementation details of several multi-agent algorithms, followed by some workable usage examples.


Table of content

    1. Project Information

    1. 1.1 Schedule
    2. 1.2 Accomplished Work
    1. Implementation and Usage

    1. 2.1 An Introduction to Agent
    2. 2.2 Neural Fictitious Self-play(NFSP) algorithm
      1. Brief Introduction
      2. Implementation
      3. Usage
    3. 2.3 Multi-agent Deep Deterministic Policy Gradient(MADDPG) algorithm
      1. Brief Introduction
      2. Implementation
      3. Usage
    4. 2.4 Exploitability Descent(ED) algorithm
      1. Brief Introduction
      2. Implementation
      3. Usage

  1. Project Information

Recent advances in reinforcement learning led to many breakthroughs in artificial intelligence. Some of the latest deep reinforcement learning algorithms have been implemented in ReinforcementLearning.jl with Flux. Currently, we only have some CFR related algorithms implemented. We'd like to have more implemented, including MADDPG, COMA, NFSP, PSRO.

1.1 Schedule

DateMission Content
07/01 – 07/14Refer to the paper and the existing implementation to get familiar with the NFSP algorithm.
07/15 – 07/29Add NFSP algorithm into ReinforcementLearningZoo.jl, and test it on the KuhnPokerEnv.
07/30 – 08/07Fix the existing bugs of NFSP and implement the MADDPG algorithm into ReinforcementLearningZoo.jl.
08/08 – 08/15Update the MADDPG algorithm and test it on the KuhnPokerEnv, also complete the mid-term report.
08/16 – 08/23Add support for environments of FULL_ACTION_SET in MADDPG and test it on more games, such as simple_speaker_listener.
08/24 – 08/30Fine-tuning the experiment MADDPG_SpeakerListener and consider implementing ED algorithm.
08/31 – 09/06Play games in 3rd party OpenSpiel with NFSP algorithm.
09/07 – 09/13Implement ED algorithm and play "kuhn_poker" in OpenSpiel with ED.
09/14 – 09/20Fix the existing problems in the implemented ED algorithm and update the report.
09/22 – 09/30...

1.2 Accomplished Work

From July 1st to now, I have implemented the Neural Fictitious Self-play(NFSP), Multi-agent Deep Deterministic Policy Gradient(MADDPG) algorithms in ReinforcementLearningZoo.jl. Some workable experiments(see Usage part in each algorithm's section) are also added to the documentation. Besides, for testing the performance of MADDPG algorithm, I implemented SpeakerListenerEnv in ReinforcementLearningEnvironments.jl. Related commits are listed below:

  1. Implementation and Usage

In this section, I will first briefly review the Agent structure defined in ReinforcementLearningCore.jl. Then I'll explain how these multi-agent algorithms(NFSP, MADDPG, ...) are implemented, followed by a short example to demonstrate how others can use them in their customized environments.

2.1 An Introduction to Agent

The Agent struct is an extended AbstractPolicy that includes a concrete policy and a trajectory. The trajectory is used to collect the necessary information to train the policy. In the existing code, the lifecycle of the interactions between agents and environments is split into several stages, including PreEpisodeStage, PreActStage, PostActStage and PostEpisodeStage.

function (agent::Agent)(stage::AbstractStage, env::AbstractEnv)
    update!(agent.trajectory, agent.policy, env, stage)
    update!(agent.policy, agent.trajectory, env, stage)
end

function (agent::Agent)(stage::PreActStage, env::AbstractEnv, action)
    update!(agent.trajectory, agent.policy, env, stage, action)
    update!(agent.policy, agent.trajectory, env, stage)
end

And when running the experiment, based on the built-in run function, the agent can update its policy and trajectory based on the behaviors that we have defined. Thanks to the multiple dispatch in Julia, the main focus when implementing a new algorithm is how to customize the behavior of collecting the training information and updating the policy when in the specific stage. For more details, you can refer to this blog.

2.2 Neural Fictitious Self-play(NFSP) algorithm

Brief Introduction

Neural Fictitious Self-play(NFSP) algorithm is a useful multi-agent algorithm that works well on imperfect-information games. Each agent who applies the NFSP algorithm has two inner agents, a Reinforcement Learning (RL) agent and a Supervised Learning (SL) agent. The RL agent is to find the best response to the state from the self-play process, and the SL agent is to learn the best response from the RL agent's policy. More importantly, NFSP also uses two technical innovations to ensure stability, including reservoir sampling for SL agent and anticipatory dynamics when training. The following figure(from the paper) shows the overall structure of NFSP(one agent).

The overall structure of NFSP(one agent).

Implementation

In ReinforcementLearningZoo.jl, I implement the NFSPAgent which define the NFSPAgent struct and design its behaviors according to the NFSP algorithm, including collecting needed information and how to update the policy. And the NFSPAgentManager is a special multi-agent manager that all agents apply NFSP algorithm. Besides, in the abstract_nfsp, I customize the run function for NFSPAgentManager.

Since the core of the algorithm is to define the behavior of the NFSPAgent, I'll explain how it is done as the following:

mutable struct NFSPAgent <: AbstractPolicy
    rl_agent::Agent
    sl_agent::Agent
    η # anticipatory parameter
    rng
    update_freq::Int # update frequency
    update_step::Int # count the step
    mode::Bool # `true` for best response mode(RL agent's policy), `false` for  average policy mode(SL agent's policy). Only used in training.
end

Based on our discussion in section 2.1, the core of the NFSPAgent is to customize its behavior in different stages:

Here, the NFSPAgent should be set to the training mode based on the anticipatory dynamics. Besides, the terminated state and dummy action of the last episode must be removed at the beginning of each episode. (see the note)

function (π::NFSPAgent)(stage::PreEpisodeStage, env::AbstractEnv, ::Any)
    # delete the terminal state and dummy action.
    update!(π.rl_agent.trajectory, π.rl_agent.policy, env, stage)

    # set the train's mode before the episode.(anticipatory dynamics)
    π.mode = rand(π.rng) < π.η
end

In this stage, the NFSPAgent should collect the personal information of state and action, and add them into the RL agent's trajectory. If it is set to the best response mode, we also need to update the SL agent's trajectory. Besides, if the condition of updating is satisfied, the inner agents also need to be updated. The code is just like the following:

function (π::NFSPAgent)(stage::PreActStage, env::AbstractEnv, action)
    rl = π.rl_agent
    sl = π.sl_agent
    # update trajectory
    if π.mode
        update!(sl.trajectory, sl.policy, env, stage, action)
        rl(stage, env, action)
    else
        update!(rl.trajectory, rl.policy, env, stage, action)
    end

    # update policy
    π.update_step += 1
    if π.update_step % π.update_freq == 0
        if π.mode
            update!(sl.policy, sl.trajectory)
        else
            rl_learn!(rl.policy, rl.trajectory)
            update!(sl.policy, sl.trajectory)
        end
    end
end

After executing the action, the NFSPAgent needs to add the personal reward and the is_terminated results of the current state into the RL agent's trajectory.

function (π::NFSPAgent)(::PostActStage, env::AbstractEnv, player::Any)
    push!(π.rl_agent.trajectory[:reward], reward(env, player))
    push!(π.rl_agent.trajectory[:terminal], is_terminated(env))
end

When one episode is terminated, the agent should push the terminated state and a dummy action (see also the note) into the RL agent's trajectory. Also, the reward and is_terminated results need to be corrected to avoid getting the wrong samples when playing the games of SEQUENTIAL or TERMINAL_REWARD.

function (π::NFSPAgent)(::PostEpisodeStage, env::AbstractEnv, player::Any)
    rl = π.rl_agent
    sl = π.sl_agent
    # update trajectory
    if !rl.trajectory[:terminal][end]
        rl.trajectory[:reward][end] = reward(env, player)
        rl.trajectory[:terminal][end] = is_terminated(env)
    end

    action = rand(action_space(env, player))
    push!(rl.trajectory[:state], state(env, player))
    push!(rl.trajectory[:action], action)
    if haskey(rl.trajectory, :legal_actions_mask)
        push!(rl.trajectory[:legal_actions_mask], legal_action_space_mask(env, player))
    end
    
    # update the policy    
    ...# here is the same as PreActStage `update the policy` part.
end

Usage

According to the paper, by default the RL agent is as QBasedPolicy with CircularArraySARTTrajectory. And the SL agent is default as BehaviorCloningPolicy with ReservoirTrajectory. So you can customize the agent under the restriction and test the algorithm on any interested multi-agent games. Note that if the game's states can't be used as the network's input, you need to add a state-related wrapper to the environment before applying the algorithm.

Here is one experiment JuliaRL_NFSP_KuhnPoker as one usage example, which tests the algorithm on the Kuhn Poker game. Since the type of states in the existing KuhnPokerEnv is the tuple of symbols, I simply encode the state just like the following:

env = KuhnPokerEnv()
wrapped_env = StateTransformedEnv(
    env;
    state_mapping = s -> [findfirst(==(s), state_space(env))],
    state_space_mapping = ss -> [[findfirst(==(s), state_space(env))] for s in state_space(env)]
    )

In this experiment, RL agent use DQNLearner to learn the best response:

rl_agent = Agent(
    policy = QBasedPolicy(
        learner = DQNLearner(
            approximator = NeuralNetworkApproximator(
                model = Chain(
                    Dense(ns, 64, relu; init = glorot_normal(rng)),
                    Dense(64, na; init = glorot_normal(rng))
                ) |> cpu,
                optimizer = Descent(0.01),
            ),
            target_approximator = NeuralNetworkApproximator(
                model = Chain(
                    Dense(ns, 64, relu; init = glorot_normal(rng)),
                    Dense(64, na; init = glorot_normal(rng))
                ) |> cpu,
            ),
            γ = 1.0f0,
            loss_func = huber_loss,
            batch_size = 128,
            update_freq = 128,
            min_replay_history = 1000,
            target_update_freq = 1000,
            rng = rng,
        ),
        explorer = EpsilonGreedyExplorer(
            kind = :linear,
            ϵ_init = 0.06,
            ϵ_stable = 0.001,
            decay_steps = 1_000_000,
            rng = rng,
        ),
    ),
    trajectory = CircularArraySARTTrajectory(
        capacity = 200_000,
        state = Vector{Int} => (ns, ),
    ),
)

And the SL agent is defined as the following:

sl_agent = Agent(
    policy = BehaviorCloningPolicy(;
        approximator = NeuralNetworkApproximator(
            model = Chain(
                    Dense(ns, 64, relu; init = glorot_normal(rng)),
                    Dense(64, na; init = glorot_normal(rng))
                ) |> cpu,
            optimizer = Descent(0.01),
        ),
        explorer = WeightedSoftmaxExplorer(),
        batch_size = 128,
        min_reservoir_history = 1000,
        rng = rng,
    ),
    trajectory = ReservoirTrajectory(
        2_000_000;# reservoir capacity
        rng = rng,
        :state => Vector{Int},
        :action => Int,
    ),
)

Based on the defined inner agents, the NFSPAgentManager can be customized as the following:

nfsp = NFSPAgentManager(
    Dict(
        (player, NFSPAgent(
            deepcopy(rl_agent),
            deepcopy(sl_agent),
            0.1f0, # anticipatory parameter
            rng,
            128, # update_freq
            0, # initial update_step
            true, # initial NFSPAgent's training mode
        )) for player in players(wrapped_env) if player != chance_player(wrapped_env)
    )
)

Based on the setting stop_condition and designed hook in the experiment, you can just run(nfsp, wrapped_env, stop_condition, hook) to perform the experiment. Use Plots.plot to get the following result: (here nash_conv is one common metric to show the performance of a multi-agent reinforcement learning algorithm.)

Play KuhnPoker with NFSP.

Besides, you can also play games implemented in 3rd party OpenSpiel(see the doc) with NFSPAgentManager, such as "kuhn_poker".

env = OpenSpielEnv("kuhn_poker")
wrapped_env = ActionTransformedEnv(
    env,
    # action is `0-based` in OpenSpiel, while `1-based` in Julia.
    action_mapping = a -> RLBase.current_player(env) == chance_player(env) ? a : Int(a - 1),
    action_space_mapping = as -> RLBase.current_player(env) == chance_player(env) ? 
        as : Base.OneTo(num_distinct_actions(env.game)),
)
# `InformationSet{String}()` is not supported when trainning.
wrapped_env = DefaultStateStyleEnv{InformationSet{Array}()}(wrapped_env)

Apart from the above environment setting, most details are the same with NFSP_KuhnPoker. The result is shown below. For more details, you can refer to the experiment NFSP_OpenSpiel.

Play "kuhn_poker" in OpenSpiel with NFSP.

2.3 Multi-agent Deep Deterministic Policy Gradient(MADDPG) algorithm

Brief Introduction

The Multi-agent Deep Deterministic Policy Gradient(MADDPG) algorithm improves the Deep Deterministic Policy Gradient(DDPG), which also works well on multi-agent games. Based on the DDPG, the critic of each agent in MADDPG can get all agents' policies according to the paper's hypothesis, including their personal states and actions, which can help to get a more reasonable score of the actor's policy. The following figure(from the paper) illustrates the framework of MADDPG.

The framework of MADDPG.

Implementation

Given that the DDPGPolicy is already implemented in the ReinforcementLearningZoo.jl, I implement the MADDPGManager which is a special multi-agent manager that all agents apply DDPGPolicy with one improved critic. The structure of MADDPGManager is as the following:

mutable struct MADDPGManager <: AbstractPolicy
    agents::Dict{<:Any, <:Agent}
    traces
    batch_size::Int
    update_freq::Int
    update_step::Int
    rng::AbstractRNG
end

Each agent in the MADDPGManager uses DDPGPolicy with one trajectory, which collects their own information. Note that the policy of the Agent should be wrapped with NamedPolicy. NamedPolicy is a useful substruct of AbstractPolicy when meeting the multi-agent games, which combine the player's name and detailed policy. So that can use Agent 's default behaviors to collect the necessary information.

As for updating the policy, the process is mainly the same as the DDPGPolicy, apart from each agent's critic will assemble all agents' personal states and actions. For more details, you can refer to the code.

Note that when calculating the loss of actor's behavior network, we should add the reg term to improve the algorithm's performance, which differs from DDPG.

gs2 = gradient(Flux.params(A)) do
    v = C(vcat(s, mu_actions)) |> vec
    reg = mean(A(batches[player][:state]) .^ 2)
    -mean(v) +  reg * 1e-3 # loss
end

Usage

Here MADDPGManager is used for the environments of SIMULTANEOUS and continuous action space(see the blog Diagonal Gaussian Policies), or you can add an action-related wrapper to the environment to ensure it can work with the algorithm. There is one experiment JuliaRL_MADDPG_KuhnPoker as one usage example, which tests the algorithm on the Kuhn Poker game. Since the Kuhn Poker is one SEQUENTIAL game with discrete action space(see also the blog Diagonal Gaussian Policies), I wrap the environment just like the following:

wrapped_env = ActionTransformedEnv(
        StateTransformedEnv(
            env;
            state_mapping = s -> [findfirst(==(s), state_space(env))],
            state_space_mapping = ss -> [[findfirst(==(s), state_space(env))] for s in state_space(env)]
            ),
        ## drop the dummy action of the other agent.
        action_mapping = x -> length(x) == 1 ? x : Int(ceil(x[current_player(env)]) + 1),
    )

And customize the following actor and critic's network:

rng = StableRNG(123)
ns, na = 1, 1 # dimension of the state and action.
n_players = 2 # the number of players

create_actor() = Chain(
        Dense(ns, 64, relu; init = glorot_uniform(rng)),
        Dense(64, 64, relu; init = glorot_uniform(rng)),
        Dense(64, na, tanh; init = glorot_uniform(rng)),
    )

create_critic() = Chain(
    Dense(n_players * ns + n_players * na, 64, relu; init = glorot_uniform(rng)),
    Dense(64, 64, relu; init = glorot_uniform(rng)),
    Dense(64, 1; init = glorot_uniform(rng)),
    )

So that can design the inner DDPGPolicy and trajectory like the following:

policy = DDPGPolicy(
    behavior_actor = NeuralNetworkApproximator(
        model = create_actor(),
        optimizer = ADAM(),
    ),
    behavior_critic = NeuralNetworkApproximator(
        model = create_critic(),
        optimizer = ADAM(),
    ),
    target_actor = NeuralNetworkApproximator(
        model = create_actor(),
        optimizer = ADAM(),
    ),
    target_critic = NeuralNetworkApproximator(
        model = create_critic(),
        optimizer = ADAM(),
    ),
    γ = 0.95f0,
    ρ = 0.99f0,
    na = na,
    start_steps = 1000,
    start_policy = RandomPolicy(-0.99..0.99; rng = rng),
    update_after = 1000,
    act_limit = 0.99,
    act_noise = 0.,
    rng = rng,
)
trajectory = CircularArraySARTTrajectory(
    capacity = 100_000, # replay buffer capacity
    state = Vector{Int} => (ns, ),
    action = Float32 => (na, ),
)

Based on the above policy and trajectory, the MADDPGManager can be defined as the following:

agents = MADDPGManager(
    Dict((player, Agent(
        policy = NamedPolicy(player, deepcopy(policy)),
        trajectory = deepcopy(trajectory),
    )) for player in players(env) if player != chance_player(env)),
    SARTS, # trace's type
    512, # batch_size
    100, # update_freq
    0, # initial update_step
    rng
)

Plus on the stop_condition and hook in the experiment, you can also run(agents, wrapped_env, stop_condition, hook) to perform the experiment. Use Plots.scatter to get the following result:

Play KuhnPoker with MADDPG.

However, KuhnPoker is not a good choice to show the performance of MADDPG. For testing the algorithm, I add SpeakerListenerEnv into ReinforcementLearningEnvironments.jl, which is a simple cooperative multi-agent game.

Since two players have different dimensions of state and action in the SpeakerListenerEnv, the policy and the trajectory are customized as below:

# initial the game.
env = SpeakerListenerEnv(max_steps = 25)
# network's parameter initialization method.
init = glorot_uniform(rng)
# critic's input units, including both players' states and actions.
critic_dim = sum(length(state(env, p)) + length(action_space(env, p)) for p in (:Speaker, :Listener))
# actor and critic's network structure.
create_actor(player) = Chain(
    Dense(length(state(env, player)), 64, relu; init = init),
    Dense(64, 64, relu; init = init),
    Dense(64, length(action_space(env, player)); init = init)
    )
create_critic(critic_dim) = Chain(
    Dense(critic_dim, 64, relu; init = init),
    Dense(64, 64, relu; init = init),
    Dense(64, 1; init = init),
    )
# concrete DDPGPolicy of the player.
create_policy(player) = DDPGPolicy(
    behavior_actor = NeuralNetworkApproximator(
        model = create_actor(player),
        optimizer = Flux.Optimise.Optimiser(ClipNorm(0.5), ADAM(1e-2)),
    ),
    behavior_critic = NeuralNetworkApproximator(
        model = create_critic(critic_dim),
        optimizer = Flux.Optimise.Optimiser(ClipNorm(0.5), ADAM(1e-2)),
    ),
    target_actor = NeuralNetworkApproximator(
        model = create_actor(player),
    ),
    target_critic = NeuralNetworkApproximator(
        model = create_critic(critic_dim),
    ),
    γ = 0.95f0,
    ρ = 0.99f0,
    na = length(action_space(env, player)),
    start_steps = 0,
    start_policy = nothing,
    update_after = 512 * env.max_steps, # batch_size * env.max_steps
    act_limit = 1.0,
    act_noise = 0.,
    )
create_trajectory(player) = CircularArraySARTTrajectory(
    capacity = 1_000_000, # replay buffer capacity
    state = Vector{Float64} => (length(state(env, player)), ),
    action = Vector{Float64} => (length(action_space(env, player)), ),
    )

Based on the above policy and trajectory, we can design the corresponding MADDPGManager:

agents = MADDPGManager(
    Dict(
        player => Agent(
            policy = NamedPolicy(player, create_policy(player)),
            trajectory = create_trajectory(player),
        ) for player in (:Speaker, :Listener)
    ),
    SARTS, # trace's type
    512, # batch_size
    100, # update_freq
    0, # initial update_step
    rng
)

Add the stop_condition and designed hook, we can simply run(agents, env, stop_condition, hook) to run the experiment and use Plots.plot to get the following result.

Play SpeakerListenerEnv with MADDPG.

2.4 Exploitability Descent(ED) algorithm

Brief Introduction

...

Implementation

...

Usage

...

Play "kuhn_poker" in OpenSpiel with ED.