Breakthroughs in technology are typically attributed to a single lone genius, but research led by DeepMind scientist Thore Graepel suggests the full power of AI will be unleashed through a collective approach of multi-agents.
The UCL machine learning professor helped create AlphaGo, which pursued an individual strategy called competitive self-play to become the first computer program to defeat a human professional Go player in 2015.
He's since turned his focus from competition to cooperation, using deep reinforcement learning to understand how teamwork develops among self-interested agents, whether they're computer programmes or human social dilemmas.
"We believe that this kind of model is a powerful baseline to study these kinds of social dilemmas in more detail," said Graepel at the AI for Social Good symposium at the Turing Institute.
His work forms part of DeepMind's ambitious mission "to solve intelligence". This formidable objective sensibly has no deadline, but to pursue it requires a clear meaning of intelligence.
Graepel refers to the definition developed by DeepMind cofounder and chief scientist Shane Legg in his PhD with researcher Marcus Hutter:
"Intelligence measures an agent's ability to achieve goals in a wide range of environments."
This agent is typically thought of as a single individual, but may be better described as a collection of collaborators, each of which has their own perceptions, goals and actions.
Workplaces and family, governments and markets are all multi-agent activities. The individuals compete, cooperate, negotiate and predict to reach their goals. AI should capture all of these actions.
"Intelligence didn't evolve in isolation," explained Graepel. "It required interactions and coevolution with other agents, and I think that's an important principle that we need to pay intelligence to when we develop artificially intelligent agents."
The artificial mind
The idea that the artificial mind should be built of many components is not a new one. AI pioneer Marvin Minsky developed the concept into a theory he called "The Society of the Mind".
In his 1986 book of the same name, Minsky describes the mind as "society" of individual agents whose interactions form a unified mind.
These multi-agent designs are also typically robust. Agents can leave and be replaced and scale, move or develop as required.
There are also big challenges in multi-agent systems. An individual agent often only has a local view of what's happening, while the larger organisation might be pursuing a longer-term global goal.
Agents are always learning, changing the environment and moving its targets. Incentives are necessary to ensure they behave in a way that benefits the team.
Early humans would compete with other tribes and animals for resources and cooperate with each other to acquire and share food, shelter and knowledge. This multi-agent collaboration developed the languages and societies of today.
Their behaviour in these environments will sit somewhere between the poles of competition and cooperation.
Graepel extends this idea to AI.
"The claim here is that it will be possible to develop an intelligent agent if we just view it in isolation," he said.
Learning through competition to become the world's best Go player
Go has been played for thousands of years and has 40 million players of players today. It’s also incredibly complex, which is what made it a compelling challenge for computer scientists.
In chess, there is an average of 30 possible moves from every position. In Go, that number is closer to 300. This search space is enormous and makes it hard to evaluate the benefit of each move.
The AlphaGo teamed trained a neural network on hundreds of games of historical data about positions and moves made by Go champions. In May 2017, the programme defeated Ke Jie, the world's number one player.
That version relied on hundreds of thousands of games of human data to develop its understanding. Its successor, AlphaGo Zero, would learn without this human input.
To do this, the AlphGo team devised a way it could autonomously build a curriculum that supported progressive learning through a system known as self-play. This involved setting it up as a multi-agent problem in which AlphaGo Zero could play against itself to sharpen its skills.
It had no Go-specific knowledge beyond an understanding of the topography of the board and physical dynamics of the game, and built its knowledge through experimentation.
At the beginning, AlphaGo Zero played randomly. After three days, it had surpassed the programme that beat world champion Lee Sedol in 2016. After 21 days, it had surpassed Alpha Go master, the version that beat 16 Go masters online in 2017. After 40 days, it was likely the strongest Go player in the world.
The system had become a Go master through competitive self-play, but a more general AI could have even greater power if it learned through cooperation.
Learning through cooperation to address social dilemmas
As AlphaGo competed in a game played by two individuals, AlphaGo had little time for the concerns of a collective, as it had no fears of encountering a social dilemma.
These situations arise when an individual can benefit from self-serving behaviour, but to do so would harm the group as a whole.
This creates tension between collective and individual rationality.
The selfish choice is tempting but could lead to the demise of the collective, through resource depletion, pollution and poverty.
So goes the tragedy of the commons, the theory that when no one owns a common resource, individuals will act in their own self-interest and contrary to the common good by exploiting the resource to its extinction.
Consider common pool resources such as fisheries, grazing pastures and irrigation systems. They're non-excludable and accessible to a large group of people, unlike private goods owned by an individual, and they're subtractable from the collective’s resources.
In 2009, Elinor Ostrom became the first woman to win the Nobel Prize in Economics for her analysis of these social dilemmas.
The previous academic consensus was that they needed government regulation or privatisation to distribute these resources fairly. Ostrom posited that humans could solve these problems in their community under certain conditions.
She drew inspiration from a trip to the Swiss alpine village of Törbel in the 1980s.
In the 15th century, the community of around 600 residents developed a system to regulate the use of their scarce land that is still in place today.
To limit the number of cows on the land, no citizen could send more cows than they could feed over the winter. Local officials can fine people who violate this rule, and receive half of the sum they collect as compensation for their efforts.
Graepel believes this model can be adapted for learning agents in AI systems. He and his colleagues devised a "commons game" to model this situation.
The commons game
In a digital orchard on a grid that bears a faint resemblance to Pac Man, different agents search for applies represented as green dots and receive a reward whenever they eat one.
The apple gross is density dependent, which means that the more apples there are the quicker they regrow. They do not grow at all if all of them have been eaten.
The researchers added to this mix an equivalent to the fines in Törbel, by letting agents zap each other to temporarily suspend them from the game.
The agents were trained with similar algorithms to those previously used by DeepMind to teach AI agents to play Atari games. They were each added to the same environment but had to learn independently.
In the early incarnation of the game, a single agent would gorge on the apples until there were none left, leading to another tragedy of the commons.
When the agent figured out that it was best to eat apples in moderation so they could regrow, it discovered a sustainable alternative.
Establishing the optimal harvesting rate for an individual is however far easier than it is for the collective, as the researchers learnt when they extended the game to 12 agents.
They also added social metrics for efficiency based on the number of apples harvested, for equality through the disparity in what was collected, for sustainability based on regrowth, and got peacefulness assessed by the prevalence of zapping.
The inefficiency of the agents' early efforts made the harvest somewhat sustainable through their incompetence. By episode 110 of the game, they had become good at harvesting to the detriment of the society, as they would quickly farm the apples to their extinction.
By episode 3900, the agents had developed a sustainable strategy through violence by zapping one another to reduce the demand.
Their measures of efficiency and equality had increased, but the peacefulness had declined.
The team added another element to the experiment to analyse territoriality, by equipping only one agent with the zapping ability.
This agent stuck to a corner of the map and zapped the other agents in this area, keeping it locally sustainable by scaring the others away. They would retreat to other areas that were outside his jurisdiction, leading to overpopulation and another tragedy of the commons.
This model shows how agents can interact with each other to learn how to succeed in a multi-agent world.
"We have a baseline multi-agent reinforcement learning model for a common pool resource, appropriation and the model shows the emergence of a mechanism of exclusion," he said.
Competition or collaboration?
The competitive model helped the AI agents in AlphaGo Zero to develop their abilities by playing against each other, but cooperation can be a more efficient use of resources and generate better collective results.
The algorithm takes them close to Nash equilibrium, a stable state in which no individual gains by changing strategy if the strategies of the others remain unchanged.
"In some sense, you hope that each agent plays a best response to the pool of other agents, and that that collection of best responses leads to a Nash equilibrium type of solution where then deviating from the learned collective of behaviour would be detrimental to everything," explained Graepel.
He added that the Nash equilibrium often isn't a desirable objective, if it has a better payoff for the collective than an individual changing the strategy.
It needs ways to nudge the system or the learners towards better solutions through interventions such as building walls to enable agents to act in locally sustainable ways. Further nudges such as introducing communication channels could lead to even better results.
AlphaGo Zero shows that cooperation among self-interested agents can extend to collaboration between human and machines.
Humans have developed their Go tactics over millennia, passing them from masters to students through books, lessons and games.
AlphaGo Zero reviewed these lessons on the opening sequences of the game. Those moves it valued were adopted and those it deemed inferior were discarded.
"Human knowledge gets rediscovered but as appropriate it also gets discarded and replaced with better ideas about the game," said Groepel.
There will still be a role for the human among the multiple agents in AI learning environments, but it may shift from a teacher to a collaborator and then to a student.