Survey of Multiagent Reinforcement Learning
The survey paper can be accessed here: Busoniu et al. 2008. In this post I will try to condense the basic taxonomy and highlight the main lines along which work in multiagent reinforcement learning had been conducted until 2008. A later post will talk about the work since then, and particularly in the last 2 years following the seminal Deep QNetworks paper.
What about predefined behaviors? Why learn? It is certainly possible to program multiple agents to act (and even coordinate) in a shared environment, and has even been deployed to great effect by companies like Disney for attractions, and Amazon in their warehouses, it requires very controlled environments and a great deal of resources (in terms of man hours spent) to get those systems to work, painstakingly. Getting multiple agents to collaborate in any significant manner will require behavior that would be VERY difficult for humans to handdesign apriori. Moreover, in dynamic environments which keep changing, those handdesigned behaviors would soon be redundant in the absence of any learning.
Background: Stochastic Games
 State transitions are the result of the joint action of all agents.
 The rewards and returns also depend upon the joint action.
 The Qfunction of each agent depends on the joint action and is conditioned upon the joint policy.
 If all agents’ reward functions are identical, they have the same goal and hence the SG is fully cooperative. If there are 2 agents and one’s reward is negative of the other’s reward, the SG is fully competitive. Mixed games are stochastic games which are neither fully cooperative nor fully competitive.
Benefits of MARL
 Exploiting the decentralized nature of the problem lends itself to easy parallelization and faster computation.
 Experience sharing through communication, teaching and imitation can help learn faster.
 Inherent robustness of the system because of multiple agents which can take over from one another
Issues with MARL
 What is a good learning goal for the agents, which will elicit the desired behavior? How do we incorporate both stability and adaptability?
 Keeping track of the other agents (who are also learning and improving)
 Learning agents render the environment for any one agent nonstationary  which invalidates the convergence properties of most singleagent RL algorithms like Generalized Policy Iteration, QLearning, SARSA etc.
 Scalability issues become even more pronounced with multiple agents. The curse of dimensionality rears its ugly head with exponential growth in the dimensionality of joint action space.
 The agents’ choice of actions must be mutually consistent, and if planning is done in the joint action space then coordination boils down to each agent consistently breaking ties between equally good strategies.
Choosing a goal
 Stability: is important because stable agents reduce nonstationarity in learning process of other agents, making things easier to solve. The following criteria have been used:
 Convergence of agents’ strategies to Nash equilibria. Nash QLearning
 Convergence of agents’ strategies to an equilibrium strategy regardless of what other agents are doing Littman 2001
 Agent’s capability to predict nearly accurate models of the other agents
 Adaptation: is needed because environment dynamics are evolving, and agents need to adapt to it. The following criteria have been used:
 Rationality: Each agent should converge to a best response if other agents remain stationary (do not change their strategies)
 NoRegret: The agent should always achieve a return at least as good as any stationary strategy, for any set of strategies of the other agents. This also prevents agent from being exploited by the other agents.
 Average Reward Bounds: Targeted optimality requires an average reward better than that of a best response. Compatibility prescribes an average reward level in selfplay, while Safety prescribes a minimum safety level of average reward.
 Opponent Awareness: Learn models of other agents and formulate a best response to them
Some taxonomy for ease of reference
 Type of task targeted: Static (stateless) or Dynamic
 Degree of Awareness exhibited: Strongly linked to the learning goal. Algorithms focused on stability are relatively less aware of other agents. Algorithms focused on adaptation are obviously aware of others, and often use some form of modeling to keep track of the other agents’ policies.
 Inspiration: Fusion of concepts from game theory (fictitious play, AWESOME, NashQ, minimaxQ), Temporal DifferenceRL (DistributedQ, FMQ), Direct Policy Search (IGA,WoLFIGA)
 Homogeneity of the learning algorithms: Does the algo work only if all the agents use it (teamQ, NashQ), or others can use other learning algorithms (AWESOME, WoLFPHC)
 Prior Knowledge: A task model is available to the agent (AWESOME), or not (teamQ, NashQ, WoLFPHC)
 Agent Observations: Agent might need to observe actions of other agents (teamQ, AWESOME), their actions and rewards (NashQ) or neither (WoLFPHC)
Some MARL Algorithms
 Fully Cooperative Tasks: Decentralized decision making, learning the optimal jointaction values with Qlearning is easy but coordination problems arise when there are multiple joint actions that are unique, and each agent breaks ties randomly, resulting in a suboptimal joint action. These algos assume perfect perception of state and of other agents’ actions. Most also suffer from curse of dimensionality (except DistributedQ and FMQ).
 CoordinationFree:
 TeamQ: Assume unique optimal jointaction (hardly ever the case). Each agent learns the optimal jointaction value by Qlearning, and uses that to select joint actions.
 DistributedQ: It works only with deterministic policies. Each agent’s local Qfunction (depending only on its own action) is updated only when it leads to an increase in the Qvalue, and the local policy is updated only if it improves the Qvalues. Provably converges to jointoptimal policies if reward is positive and Q’s are initialized with zero values.
 CoordinationBased: Coordination graphs additively decompose the Qfunction into multiple local Qfunctions depending on only a subset of the agents, making things easier to solve individually and then aggregating their solutions.
 Indirect Coordination:
 Joint Action Learners: Static Tasks: Each agent learns models for all the other agents based on empirically counting how many times it observed the other taking a particular action.
 Frequency MaxQ (FMQ): Static Tasks: It works only for deterministic tasks, where variance in the reward resulting from an agent’s actions can only result from other agents’ actions. It empirically tracks the frequency with which any particular action has yielded good rewards in the past, and increases the Qvalue for those joint actions, leading to coordination developing.
 Optimal Adaptive Learning: Dynamic Tasks: Uses virtual games where optimal joint actions are rewarded with 1, and others with 0. This biases the agent toward recently selected optimal actions, and provably converges to optimal joint policies in fully cooperative tasks. However computational complexity is high.
 CoordinationFree:

Fully Competitive Tasks: Minimax principle is applicable: maximize the minimum expected reward. MinimaxQ is opponent independent and uses minimax principle to compute strategies and values, and a TDrule to propagate those values. Opponent Modeling can do better than the minimax return if the opponent is suboptimal, i.e., does not always take the most damaging action). The model is built by empirically observing the frequency of taking each action in a particular state.
 Mixed Tasks: Most of the following algorithms are for static games, which have limited applicability in realworld applications WoLFPHC, NSCP are exceptions. Static game algos also assume availability of exact task model, which is not the case usually. Understanding the conditions in which singleagent RL works in multiagent settings would be an important step since there are theoretical guarantees available for singleagent RL algorithms.
 Single agent QLearning: Tuyls et al. 2006 applied evolutionary analysis to behavior of Qlearning with Boltzmann policies in repeated games, and found that there was some limited convergence in some games.
 AgentIndependent methods: Policies and state values are computed with gametheoretic solvers for each stage game, for example Nash QLearning which computes the Nash equilibrium at each stage. However, if this equilibrium is not unique, then the equilibirum selection problem arises where each agent might break ties differently and end up with different updates.
 AgentTracking methods: Estimate models of other agents’ strategies and formulate a best response to those. Usually not stable. Observability of other agents’ actions is assumed. e.g. Fictitious Play and Nonstationary Converging Policies (NSCP)
 AgentAware methods: Target stability as well as adaptability.
 AWESOME: It tries to adapt to the others’ strategies when they appear stationary (using fictitious play), but retreats to a centrally precomputed Nash equilibrium strategy when nonstationary. It is rational, and converges to a Nash eq in selfplay, and valid for any number of actions and players, while relying on observing only the other agents’ actions, not their strategies.
 Infinitesimal Gradient Ascent (IGA), WinorLearnFastIGA (WoLFIGA) and Generalized IGA (GIGA) are direct policy search methods which use gradient update rules that guarantee convergence in specific static games. IGA works for 2agent, 2action games, uses constant gradient steps and the average rewards converge to Nash rewards for infinitesimally small step size. WoLFIGA uses a small learning rate when winning, and a large winning rate when it is losing.
 WoLFPolicy Hill Climbing (WoLFPHC) updates policies with a WoLF step size, and Qfunctions with the Qlearning rule.
 Explicit Coordination Mechanisms: Could work with any of the above 3 settings. The coordination problem usually involves all agents breaking ties in the same way, which can be achieved using social roles, conventions and communication. Social roles restricts the set of actions available to an agent and prevent ties. Social conventions encode preferences towards certain joint actions and can totally eliminate ties if properly designed. Communication can obviously be used to communicate action choices, either alone or along with conventions and roles. If comms are available, only an ordering of agents has to be known to break ties deterministically.
Conclusions
 Scalability is the big challenge for MARL. Most of the above presented methods require tabular representation of Qvalues, which is no good for any real application with continuous state and action spaces. The work on function approximation in singleagent RL needs to be extended to multiple agents.
 Exploiting the decentralized, modular structure of the multiagent task can distinguish MARL methods completely from singleagent RL methods and provide a quantum leap in performance.
 Making learning easier by providing some domain knowledge, by choosing a suitable model for the approximator, shaping rewards to reward promising behaviour, decomposing into subtasks and using hierarchical RL.