Quick note: a "behavioral prior" refers to a bias or prior knowledge incorporated into a learning algorithm, that guides the model's behavior towards a desired behavior or task.
It is used to shape ...the model's learning process and influence its decision-making based on certain predefined preferences or constraints.
By incorporating prior knowledge or biases, the learning algorithm can benefit from existing expert knowledge or desired behaviors, leading to more efficient or reliable learning.
Behavioral priors are often used to address challenges such as data scarcity, exploration-exploitation trade-offs, or safety considerations.
In this paper, the authors show (later) that simply training the model from scratch (using Reinforcement Learning) to play Minecraft doesn't work. But if you start RL with some knowledge of the game already in the model, it does surprisingly well! So that's why they call it a "behavioural prior."
"Zero-shot capability" refers to the ability of a model or system to perform a task or generalize its knowledge to new examples or domains without any specific training or prior exposure to those exam...ples or domains.
It means that the model can accomplish a task even if it has never seen or been trained on similar examples before.
The term "zero-shot" indicates that the model is capable of achieving this without requiring any additional training or fine-tuning.
To develop more intuition about this, read the first half of this excellent article by "Ekin Tiu" from Stanford: https://towardsdatascience.com/understanding-zero-shot-learning-making-ml-more-human-4653ac35ccab
Alrighty, looks like an overview here would be helpful. :)
There are a few key takeaways in this paragraph:
1. How semi-supervised learning methods have *typically* been used, and the problems with ...them
2. The key hypothesis of the authors (how they propose to solve the problems and do better)
3. The reason WHY they believe this hypothesis is sound, and will likely work.
4. The statement that it DID work :)
---
Now let's dissect them one by one. The main technical concepts here are:
1) Imitation Learning
2) Inverse Dynamics Models (IDM)
3) Behavioural Cloning (BC)
Imitation learning is actually a family of methods that involves learning from expert demonstrations, i.e. imitating the expert. It's a somewhat broad term that includes both supervised and unsupervised methods.
"Behavioural cloning" is a very specific form of imitation learning, where a model is trained to *replicate* an expert's behavior — given a state of the game, the model's goal is to take the same action that the expert would have.
For BC, the training inputs are: 1) snapshots from the game/environment, and 2) the exact action that the expert took in that state. Then you simply train the model in a good-old supervised fashion. Over time, given enough data, the model learns to replicate/clone the expert's patterns.
The challenge with BC is in actually *having* enough labeled data to train the models.
So we come to the "Inverse Dynamics Model" — it's like learning the "physics" and "rules" of the game/environment. It captures information like, "if you do X in the game, Y happens."
IDM captures the following info: given any transition in the game (eg: player moved forward), it tells you which action was taken, that caused the transition (i.e. the gamer probably pressed the "forward" key).
These mechanics are fairly consistent in a game, regardless of whether the game is being played by an expert or a noob. And once you understand these controls and how they affect the game, it's opens up the ability to learn from watching unlabelled video game footage online!
So this paper uses IDM models as a clever workaround to get over the data limitation of BC. Here's how.
Once you have a model that knows the game's internal dynamics and controls, it can be used to LABEL millions of hours of UNLABELED gameplay footage on Youtube (i.e. seeing a transition, it guesses which controls the players used).
This then becomes a beautiful labeled dataset that you can train a BC model on.
Training the IDM is a "non-causal" task — to simply see that the player moved forward, you have to see both his initial (past) and final (future) position.
In contrast, BC training is "causal" — because there, your goal is to predict WHAT a human player would do if they were in that initial state. They may decide to move forward, or turn right or left, or jump, or whatever. Here you're trying to learn the "intent" of the expert player.
Hope it made sense?
So, this is going into the weeds of how the IDM and BC models work.
Recall that the BC model works like this:
Input: a given game STATE
Output: the ACTION that the expert (whose behaviour is being c...loned) would have taken
However, the IDM works like this:
Input: A given game state, and the NEXT game state
Output: the action that *was probably taken* such that the game moved from the first state to the next
In an ideal case, the IDM, when given a series of game transitions, can tell you *exactly* the controls/actions used that caused those transitions to happen.
Now I'll try to explain what Torabi et al.'s work attempted to do, what was its limitation, and how OpenAI tried to address it:
They took a randomly initialized BC model (not trained), and let it play around in the game on its own. At each state, it threw out a random action, and saw the result of that action.
They used these observations to train an IDM model to understand the mechanics/rules of the game.
Then, they used this IDM model to label human gameplay data (without labels) — the model watched the game transitions and learned the likely actions taken to make them happen.
Then they trained the BC model further on that data — and put it back in the environment to collect even more experience, to further improve the IDM's understanding of the game's control.
--
OpenAI says that the challenge with this is that, let's say the IDM isn't good enough yet to accurately predict how a certain human player executed a particular move in the game — it guesses the wrong moves.
Now, how do you fix that during training? That's really hard, because the BC model (whose exploration creates data for the IDM) has to attempt to execute that same move on its own, so that the IDM can learn that it doesn't work! (i.e. the guessed actions are wrong)
Does this make sense so far?
To solve this, they simply use human contractors to label the initial data (to train a really good IDM), instead of starting with a random BC model like they did above.
But I'd be curious if there was a way to achieve this without throwing money at getting data labeled by human contractors. My gut says that we could change something (like the BC model's exploitation-exploration function etc) to do that, but I'm not 100% sure of course.
I shall attempt to explain in simple words, the IDM notation of P(a.t | o.t, o.t+1) in the 3rd line:
The goal of the "inverse dynamics model" (IDM) is that given an *observation of what happens* in t...he game, it predicts *which action must have been taken* to cause it.
The "observation of what happens" includes a two snapshots: at time t, and t+1.
So for example, let's say a game character moves 1 step forward. The *observation* we will record 2 states: before the step, and after the step — o.t, and o.t+1
The action to be predicted was taken at time t — a.t
So, we're predicting a.t *given* o.t, o.t+1, which is denoted as:
a.t | o.t, o.t+1
A beautiful illustration of the entire training process!
As we see in the IDM training (pink), both the past and future states of the game go as inputs into training, which makes it "non-causal."
I've never played Minecraft, so I don't quite understand whether it's obvious why the model showed new behaviours like "crafting wooden tools" when fine-tuned on videos of people building houses, but ...not in the early stages of the game when it's just crafting tables etc.
Hope someone could shed light on this with context?
Appendix I (although not snipped here, but available in the full paper linked to at the top of this page) is a really worthwhile read. Since the current models couldn't be "asked" to do certain things... (like crafting a table), and only did things on their own as is expected in Behavioural Cloning, they tried to train the model with some additional language input to see if the agent would learn to follow instructions.
They basically took a subset of the gameplay video data, in which the human players were declaring their intentions etc (filtering it down from 70k hours to 17k hours), and then fed that as an additional input.
The results are not very concrete/convincing (yet), so they buried them deep in the appendix, but very cool and worth looking at!
"Semi-supervised" is a slightly confusing term. I will explain.
In "supervised" learning, you have labels for every single data sample. X –> Y.
In "semi-supervised" learning, you have a lot of da...ta samples with NO labels, but you also have some samples which DO have labels.
You use a combination of the two to train your models. Of course, it would be ideal if we had labels for everything, but often that's not available (or too expensive).
The rest of the paper walks you through how THEY used this "semi-supervised" learning.
"Voxel-based" basically means that instead of polygon-based graphics (which allow you to create very detailed shapes and curves), voxels are more like "blocks."
The last line caught my attention: 720 V100 GPUs? I checked the price of a V100 on Amazon, and it's around $4,685.
This means they had *at least* $3.2 million worth of GPUs in-house for a single expe...riment! (I don't believe they use cloud infrastructure)
Add to that the electricity bill of running those GPUs for 9 days. Deep Learning is indeed an expensive hobby. :)
Here's the difference between using the native human GUI and the "crafting and attacking macros" used by prior work in Minecraft.
The former is self-explanatory: the AI has to learn to play by output...ing exact keyboard and mouse commands at 20 fps, like humans.
The latter refers to "simplifying" the game interface into a set of higher-level commands. So for instance, instead of using the mouse and keyboard to chop wood in a game, you simply give the MACRO command "chop wood" and it proceeds to use the controller in the required way.
When you train a model to choose such high-level actions, it's obviously much simpler than training it to actually play the game like a human!
I explained the intuition behind "imitation learning" and "causal" policies earlier (in the 3rd paragraph), but here's the authors giving a more formal description of what it is.
Alright, let me explain the "Kullback-Leibler" (KL) divergence loss in simple words.
The problem being faced here, is that during reinforcement learning, the model starts to "forget" the data that it... has learned early on.
So in a way, the model continues to "diverge" from its original self while being fine-tuned. Our goal is to reduce this divergence.
How do we do that? Well, if we had a way to *quantify* how much the model has diverged, we could try to minimize that quantity during training.
Kullback-Leibler is, for our purposes, one way to quantify this divergence. Given two probability distributions, it tells you how much they differ from each other. It comes from Information Theory, a technical field that I personally don't know much about (even though I majored in communications engineering in college; one of the many mistakes of my life).
So while training, we keep calculating the KL divergence between the model's most current version and its initial version, and try to minimize that. Of course, this isn't the primary quantity being minimized! (In that case, the model won't learn anything new at all.) That's why it's mentioned as an "auxiliary" loss function.
It's a delicate dance of wanting the model to improve, but not grow into another direction and forget its initial learnings completely.
This kind of problem is more serious in very open-ended exploration-based games like MineCraft. If the game environment was very simple with not much to explore, I doubt this would be a major problem.
Hope it helps! :)
Here's a key point (in the first line). I haven't played Minecraft so not sure how rewards/points typically work in the game? Do you get points for building things like tables or wooden sticks?
Often... in real RL experiments, we mainly reward the agent for "winning or losing" or whatever is based on the game's inherent points system. But that sort of training usually takes a loooong time (especially for complex games), so to make it easier, you can give the agent more frequent rewards, to "guide" the training process — in this case, for instance, while crafting a "diamond pickaxe," the agent gets rewarded for every item obtained in the sequence.
Overall it feels like they've made quite a few tweaks to make the RL portion of this experiment simpler.
But I believe that's okay, because the purpose of this research paper was not to "prove how they used RL to solve Minecraft," but rather to focus on the IDM to "unlock" the bulk of free gameplay data available on the internet. The RL fine-tuning is presented more like an afterthought.
So, this is interesting. Again, a little knowledge of Minecraft would probably be helpful here — but it's strange why, even with lots of ways to get a reward (as mentioned previously, the agent gets p...oints for every item it crafts on the way to a diamond pickaxe, not just for the final product), the agent doesn't achieve *anything* with plain RL from scratch!
I'm wondering what would happen if they put a LOT more resources into solving it with RL, and trained the model like they did in their Hide-and-Seek paper (https://openai.com/research/emergent-tool-use)
That paper is also on DenseLayers btw: https://denselayers.com/paper/emergent-tool-use-from-multi-agent-autocurricula
A quick explanation of "negative log-likelihood" for those not familiar with the math:
You probably understand by now that by training the IDM model, we want it to learn a certain action label that c...orresponds to a set of game transitions.
Eg: if the character moves forward, the action was probably the keyboard command for moving forward).
In this example, we want to *maximize* the probability that the model predicts "move forward."
In the math, we capture this probability as a "log-likelihood," which refers to a logarithmic function (you don't quite need to know that to understand this paper, but feel free to look it up).
So, we're trying to maximize log-likelihood. But since in ML training algorithms, we can't actually maximize things but only minimize them, we just take the *negative* of the value we're trying to maximize, and minimize that!
Simple? :)
Regarding the "cleaning" of data, that filters down the dataset from 270K hours to only 70K (meaning 75% of the data was discarded!), I wonder if it would be more efficient to train a model like this ...first, and then randomly sample the internet videos to make sure only clean ones got downloaded in the first place!
This would probably get them much more data than doing it sequentially in two steps.
Quick math point!
First, don't be scared by this expression. :) It's fairly straightforward. If you spend some time with it, it will make a lot of sense.
One thing to note/observe is that in the fir...st half of the expression (a.t | o.1,.... o.t), the last "t" is lowercase, but at the END of that line, the "T" is uppercase.
That's because the a.t is taken from the IDM, which is trained on both past (t=1), present (t=t) and future (t=T) states, while the "Foundation Model" (basically the behavioural cloning model) is only trained on the past and present states. :)
This just goes to underscore how much easier it is to train an IDM (i.e. capture the game's basic underlying mechanics and rules), than to learn a "policy" for playing it well from scratch.
Wow! The key information here is the TWO ORDERS of more data efficiency when you use an IDM to learn the game first, and then train the BC model on that.
This is the evidence for their main hypothesi...s in this paper, that IDMs are more data efficient.
Oh, so the VPT model, after being trained on ~70,000 hours of gameplay video, obviously doesn't match up to the performance of amateur human players. (Technically, the humans craft 28 times the number... of tables that the model does.)
I wonder if that's simply because behavioural cloning by itself is not very promising in terms of achieving amazing performance (except for domains where you can't use RL no matter what, such as autonomous vehicles etc).
But then RL takes a lot more training time, so there's the trade-off.
These results somewhat remind me of DeepMind's original AlphaGo paper, where they too used a form of imitation learning as their foundational policy model, and then improved it with self-play / RL.
B...ut the interesting thing is that in their next research stage, for AlphaGoZero, they skipped the imitation learning altogether and went with only RL/self-play, and managed to achieve even better performance with that!
Again, I haven't played Minecraft, but it would probably be helpful in understanding why these results look like this. :)
To think that it took multiple researchers over a YEAR of FULL-TIME work to put this paper together, is unsurprising given the depth and breadth of the experiment!