Here's the url mentioned in the caption:
https://openai.com/blog/emergent-tool-use/
It also has many videos and dynamic data visualizations; worth checking out!
So, here's the big picture here. A major requirement for creating "general AI" that can exist in the physical world, is that the AI agents are able to automatically LEARN to interact with physical obj...ects and environment, and use them to their advantage.
The authors here are basically showing that ONE GREAT WAY of achieving that (i.e. agents are able to learn to interact with their surrounding objects appropriately) is to create a *competitive environment* for agents, where they are somewhat forced to learn/adapt to win/survive.
So they created a game of Hide-and-Seek, and put physical objects (blocks, etc) in the environment, such that both hiders and seekers can use them to their advantage if they find a way to.
Then the question arises, how do we evaluate how "capable" the agent is becoming? (Not just qualitatively, but quantitatively). We want to know if the agent is actually learning general skills (like holding, counting, etc) and becoming more capable overall, and not just learning how to interact with a specific object in a specific manner?
To evaluate that, they propose that we measure the performance of the hide-n-seek trained agents on other tasks — how quickly they learn to generalize, etc.
For anyone attempting to read this paper, here's the pre-requisite knowledge you'll need (and not need) for it. Most of the more technical terms are covered in the discussion threads, but in general:
...
1. Don't worry about terms like "autocurricula" etc. They're explained in the paper fairly well.
2. A basic understanding of policy gradients is helpful. Here's a lecture by Andrej Karpathy that would help build intuition around them: https://www.youtube.com/watch?v=tqrcjHuNdmQ
PPO and GAE are also very helpful to know, but are somewhat more advanced concepts. They're also covered in the Dota2 paper.
3. The concept of "intrinsic motivation" in Reinforcement Learning is quite important, because this paper's approach is presented as a superior alternative to that. A superficial understanding would be helpful.
4. Understanding "value functions" is also helpful — for a very beginner-friendly explanation, consider one of my own writings here: https://aman-agarwal.com/2018/03/10/explained-simply-how-an-ai-program-mastered-the-ancient-game-of-go/
5. There are a lot of other smaller technical concepts referred throughout the paper (masked residual self-attention, entity embeddings, etc) that would be helpful to know about; if you don't see a helpful, BRIEF explanation in the discussion thread, consider being a good samaritan and contributing one!
For anyone who still doesn't quite get what "Autocurricula" means: it's short for "autonomous curriculum learning," which basically involves creating an adaptive and self-evolving curriculum for train...ing an agent (such as a robot or an AI system).
With autocurricula, you allow the machine to continuously assess its own performance and adjust the difficulty and focus of its learning tasks accordingly. It's like having a personalized learning plan that adapts to the machine's current abilities and challenges.
By dynamically adjusting/creating the curriculum, the machine can optimize its learning process and focus on areas that are challenging but within its reach. This helps the machine learn more effectively and efficiently.
The authors convey here that training AI agents using self-play / competition isn't new; what's new in this paper is doing the same in PHYSICAL settings, not just video games.
RL techniques *have* proven to work well in the real world, in some research by OpenAI themselves. :)
Take a look (we may curate some of these papers in the near future if there is demand):
1. Solvi...ng rubik's cube with a robot hand — https://openai.com/research/solving-rubiks-cube
2. Learning dexterity — https://openai.com/research/learning-dexterity
Hmmm, not sure to what extent I agree with the last statement — that in single agent RL settings, there's an "inherent bound" caused by the task description and that there's little room to improve.
T...he only difference between those settings and this paper's self-play / competition is that the environment keeps changing and getting gradually more "difficult" to solve the task at hand, which is similar to the role played by competing agents in this setting. And even here (as you'll see later), they had to engineer the reward functions too (such as giving a -10 penalty if the agents leave the core playing area).
So as the agents start learning to play the game from scratch, they figure out new strategies one by one. Each team is trying to outdo the other, which leads to the other team developing counter-strat...egies.
This image basically shows the different strategies and counter-strategies the agents come up with.
A, C, and E represent strategies used by SEEKERS to win. B, D, and F represent counter-strategies developed by HIDERS. It's fascinating to watch!
Quick note here on "object permanence": when you show the child a toy and then hide it, does the child think it has ceased to exist, or that it's just... hidden?
The latter is due to object permanen...ce, which is a learned behaviour. It's the ability to know that an object still continues to exist, even if it can't be seen or sensed anymore. :)
So their main claim here again, is that instead of directly telling the agents to interact with tools (through incentives), they just let the objects exist in the environment, and have the agents auto...nomously explore whether the tools could be advantageous in helping them win the game.
THIS is one of the main steps forward represented by this paper.
So, the last two sentences point to a big "what's next" ideas of this paper : create ML techniques by which agents learn to "reuse" skills they've learned in one environment in another.
I guess the ...question here is, how to build AI models such that the skills being learned are "unentangled" from the task at hand, i.e. if the agent gets a reward for pouring wine into a glass without spilling, it should do so by developing independent skills of grabbing and picking the bottle, hovering its mouth at the right height above the glass, noticing when the glass is full, etc.
Such a cool ML problem.
So, now they're trying to see if learning to play hide and seek made the agents "more capable" at other tasks.
It's somewhat analogous to saying, if a kid plays soccer and becomes good at it, does th...at help improve their memory, cognition, planning, and other things. Since the goal here is not just to make the agents better at playing hide-n-seek, but to use the game as a means to an end.
(P.S. Reminds me of that documentary where they tested Cristiano Ronaldo's mental skills: https://youtu.be/t03LHpeWnpA)
This paragraph is a bit confusing to me. When they say they're trying to "evaluate the learned behaviours," it's a little vague as to what they mean.
The 4th line from the end seems to carry a hint: ..."whether improved performance stems from new adaptations or improving previously learned skills." I guess they're just talking about evaluation in general, or trying to understand the REASON WHY certain behaviours are being learned, and find quantitative patterns that reflect these improvements.
So (I think) they're saying that the strategies developed by the agents are a direct result of finding new ways of using the tools.
Not sure if there's something insightful here that I'm missing, but... to demonstrate that correlation, they see that agents "play" with objects a lot more (i.e. moving them around more) while trying to develop new strategies.
Perhaps someone could explain if there's anything more to this?
Let me explain the concept of "Intrinsic Motivation" a little.
They are essentially the opposite of competition-based motivation.
Let's say you have a robot inside a maze. The robot gets a reward of... +1 if it makes it to the other end.
Naturally, to solve the maze, the robot's algorithm has to make it "explore" the maze. And "exploring" by default means going to places that the robot hasn't been to before.
Since the robot doesn't get a real reward merely for exploring, you have to give it some "intrinsic" incentive for doing so, which has nothing to do with the rules of the maze, but is programmed into the robot's algorithm.
This is the basis of "intrinsic motivation" methods — and there are a variety of ways you can implement it.
A couple of them are mentioned here: count-based exploration, density estimators, etc.
The probelm with intrinsic methods, which is discussed later in the paper, is that it requires quite a bit of domain knowledge of the game in order to properly program the incentives for "exploring" it. (Let me know if this point isn't very clear, and I'll add a longer explanation.)
Count-based exploration is a method to encourage an agent to explore its environment by keeping track of how many times it has encountered different states or actions.
Here's how it works:
Count...ing Visits: The agent keeps track of how many times it has visited each state or taken each action. It maintains a count or record of these visits.
Exploration Bonus: When the agent explores a new state or takes a new action, it receives an extra "bonus" or reward based on how unfamiliar or rare that state or action is. This encourages the agent to try out new things, increasing its exploration.
Decision-Making: When it's time to decide what action to take, the agent considers both the expected rewards from previous experiences and the exploration bonus based on the counts. The exploration bonus helps guide the agent towards less-frequently visited states or actions, promoting exploration.
"Causal Influence" here refers to the impact that one agent's actions or behaviors have on other agents within the system. It falls under the umbrella of intrinsic motivation methods.
(It's also call...ed "Social influence", in this research by DeepMind that's cited here as "Jacques et al": https://arxiv.org/abs/1810.08647)
Basically, an agent is independently rewarded (or penalized) for taking actions that are likely to have a bigger impact/influence on the behaviour of OTHER agents in the environment.
By forcing agents to learn/predict the impacts of their actions on others, they learn a model of the other agents. And this apparently works as well as if the agents were mutually sharing information and coordinating as a centralized team.
Forward Dynamics Model: In transition-based methods, the agent learns a forward dynamics model. This model predicts what will happen next in the environment based on the current state and the action t...aken. It tries to understand how the environment changes from one state to another.
Inverse Dynamics Model: The agent also learns an inverse dynamics model. This model predicts the action taken by the agent based on the current and next states. It tries to understand the relationship between states and the actions required to reach those states.
Prediction Error: During training, the agent compares the predictions made by the forward or inverse dynamics models with the actual outcomes or actions. The difference between the predicted and actual values is called the prediction error.