Max Schwarzer
TalkRL podcast is all reinforcement learning all the time, featuring brilliant guests, both research and applied. Join the conversation on Twitter at talkRL podcast. I'm your host, Robin Chohan. Max Schwerzer is a PhD student at Mila with, Aaron Corville and Mark Belmer. He's interested in RL scaling, representation learning for RL, and RL for science.
Robin:Max has spent the last year and a half at, Google Brain and, is now at Apple MLR. Thanks for joining us today, Max.
Max:Thanks for having me, Robin. I'm really excited to be on the podcast.
Robin:You had state of the art results with your BBF agent that I understand is gonna be at ICML 2023. Is that right?
Max:Yep. Exactly. Presenting it in, July, this year.
Robin:So I understand with that agent, you got, state of the art results in terms of performance on Atari 100k, which is the small sample, Atari probably with, implications for all sorts of small sample environments. So that's we're gonna focus on that in-depth in just a couple minutes. But, we're gonna start with a little bit more philosophy from on your website. Something I saw on your website, you mentioned your research philosophy which I found interesting. You said, I believe the best way to advance our understanding of complex systems like large language models and reinforcement learning is through rigorous experimentation.
Robin:RL has long struggled with the combination of sloppy, underpowered empirical practices, and an ideological attachment to theoretically tractable methods that don't perform competitively outside of small regimes. Ouch. So, and then I see you were the co author on the 2021 paper which talks about this issue in more depth. That's deep reinforcement learning at the edge of the statistical precipice.
Max:Yep. Exactly.
Robin:By Agarwal et al with yourself as a co author. I mean, definitely remember that paper and hearing about that at the conference. And how how has that been received? Do you feel that, like, our research has, grown up a bit since that paper came out, changes in how people prefer to present evaluations and things like that?
Max:I I definitely think it's made a dent. Yeah. There there was a lot of stuff you saw back in 2018, 19, 2020 that we, in a lot of cases, specifically pointed at in the statistical precipice paper. You know, people doing experiments with minuscule numbers of seeds, not reporting any confidence intervals or doing any testing, just, you know, a 2 seed point estimate, things like designing experiments in ways where you just snuck in a max over seeds into your experimental protocol that I really haven't seen as much lately. So I I do feel like we've made a significant improvement.
Max:You know, we we released an evaluation library with that paper called Reliable, r l I a b l e, which we I see in about a third to a half of RL papers, which is honestly really, really good penetration compared to, some of the other people that who are trying to clean our all up, like, you know, debar all that matters, where, you know, everybody read the paper and they're like, yeah. We should probably fix this, and then didn't do anything for 3 more years. So I I am pretty happy about how the field has matured. I also feel like our research has grown up a little in that, you know, back in I think we'll probably get to this a little bit later on in the podcast too. But back in sort of 2019, 2020, 20, it felt like people were often fighting over, really, frankly, really small performance improvements because we had a much narrower sense of what things worked.
Max:We weren't as interested in scaling, and a lot of ideas that have turned out to be incredibly impactful for performance, like looking at plasticity loss, network resets, etcetera, just didn't exist didn't exist back then. So simultaneously, I feel like effect sizes have gotten larger which makes it easier to do good science because you're less likely to get an incorrect result. And we've started doing our science better just from a pure method standpoint too. So I'm I'm happy with how the field is involved.
Robin:Awesome. Yeah. That is incredible, market penetration with that library there. I'll have to check that out and we'll have the link to it on the episode page.
Max:Awesome.
Robin:With your BBF agent
Max:Bigger, better, faster.
Robin:You got really exciting results on this Atari 100 k, benchmark. And a and a and a 100 k is is a tiny amount of samples especially compared to that original d q n or back in the day it was it was way up in the millions.
Max:Yep. Exactly.
Robin:And with some agents like the curiosity agents going up to 1,000,000,000 of samples. So this is I understand there's about 2 hours of play, on each game which is some kind of comparable to a human. How would it what a human would take?
Max:So exactly. So you'll see in a lot of Atari 100 k papers the claim that 2 hours is the same amount of time as a human, and what they're really referring to is the, the human scores that were released by those early, DeepMind DQN papers, the original DQN, double DQN, who, as far as I know, essentially sat a couple of people down in a room and had them play a Atari game for a couple of hours and then recorded their scores at the end. Now one of my, my co adviser, Mark Belmar, was around at DeepMind in that era era and has told me that the experimental protocol was not necessarily exactly that clean. But at least on paper, you know, we can say that if we if we beat the human score from those papers with 2 hours or less of gameplay, it's pretty reasonable to say that we're learning the game as fast as or faster than humans did.
Robin:That is remarkable. So was your research philosophy that you mentioned, earlier, was that important for in developing your BBF agent?
Max:Absolutely. Yeah. And a lot of this stuff that in the statistical Precipice paper came out of research practices that I had been developing in papers that came before BBF. So things like, for example, run lots of random seeds, use bootstrapping to estimate confidence intervals. I was doing that for my own hyperparameter tuning before we put it in the paper.
Max:And it turns out that's it's extremely important to do that sort of thing. You can end up, if you don't, you can just end up reaching an incorrect conclusion and spending, you know, 3 months chasing down this rabbit hole because you got lucky with 2 random seeds once and thought that, you know, your DQN should be upside down or something like that. Just the variance is so high that if you're if you're not running your experimentation carefully and using nice statistical practices, you can very, very easily reach wrong conclusions. And that's when that happens in a published paper, it's a catastrophe for science. When that happens in your own research process, it's a catastrophe for you and you personally, but it's still bad.
Robin:I understand that BBF is, is the latest in a line of agents, a kind of long line of work built on top of Rainbow DQN. That was the 2017 agent descended from the original DQN or deep q network. Yep. And rainbow was a specific collection of improvements, on top of DQN. Kind of is is is BBF kind of similar in in that it's a it's a it's a collection of improvements on top of the previous generation?
Max:I think that's a good way of looking at it. Yeah. There's some so I don't think there was anything truly fundamentally new new in the rainbow paper. My recollection is everything in there had been published before. And in BBF, we do have a couple of bits that are novel to that paper.
Max:But for the most part, it's similar. You know, we're introducing ideas that had been standard in the rest of reinforcement learning, you know, non sample efficient RL, for years and just turned out to be harmful when you had small neural networks and very beneficial when you had bigger networks. So it's it's a once you cross over that threshold of using bigger networks that do function approximation better, all of a sudden you can do essentially a second pass at Rainbow where you include another half decade's worth of innovations. So, yeah, I'd say it's it's fairly reasonable to compare it to Rainbow.
Robin:So we'll talk a little bit more about the details of of, what's in that bbf agent. But to set the stage and get there, can, can we very briefly talk about how we got, from there to here and, what kind of innovations we see at the at the different steps?
Max:Absolutely.
Robin:From DQN, Rainbow, as you mentioned, the collection of published improvements that, that seem to be important during between those times.
Max:Absolutely.
Robin:And then we we we're moving on to data efficient rainbow and onwards. So what what do you wanna can you give us a whirlwind tour through through these levels? I know we obviously, this is a lot ton of material here but just was just the general idea if you had to summarize, you know, in a couple lines what these agents are are are doing differently at each level.
Max:Yeah. So we start off we can start off at Rainbow, which you can think of as essentially the standard DQN even today. It's like the original DQN but with a bunch of improvements that individually might be relatively small, but when you pile them all together are really significant. There was a paper after that, called titled nothing to do with the algorithm that came out of it, but titled When to Use Parametric Models in Reinforcement Learning that just happened to introduce this model called Data Efficient Rainbow or Efficient Rainbow. And you'll see this in, Sample Efficient RL papers labeled as DER.
Max:And that algorithm was basic basically came out of the idea of, well, what if we you know, Rainbow doesn't do well in small data regimes. But what if we were to just tweak Rainbow's hyperparameters a little bit, train it harder, use a smaller neural network, larger end step return, that sort of thing. Could we get competitive performance in these these very small data regimes, which really meant Atari 100 k. Atari 100 k had just been proposed by, Model Based RL paper a few months before this one came out. And the answer was yes.
Max:Just taking Data Efficient Rainbow just taking Rainbow and making these changes gave you performance that was better than the model based agent at the time.
Robin:K. I remember that simple, and I made a simple kind of embarrassed rainbow by having that graph that showed that it did did way better than rainbow. So what you're saying is how to went back to the board?
Max:Exactly. So people went back to the board and looked at Rainbow's hyperparameters and really just identified a few key pain points where Rainbow was not operating well in the low data regime and tweak those. And it turned out once you did, it performed quite a bit better than Simple did. And, you know, unlike Simple which took I they had a footnote in the early version of the paper. I think it was it took them 3 weeks to do one training run.
Max:DER runs almost on CPU.
Robin:Yeah. It's ironic name, I I think, given the complexity of part of the, the state based, model they had.
Max:Exactly. Yeah. It was insane. Yeah. So DER was, you know, the performance was terrible by modern standards.
Max:I think it gets a, IQM score of about 0.2, which is 1 5th of what BBF does. But, you know, for 2019, it was a real innovation. Then there were a bunch of papers that a flurry of papers that came out after DER when people realized that, you know, sample efficiency was an open playing field and pretty much everyone knew we could do much better than what DER had shown. So there were a family of self supervised representation learning papers that came out. Mine, my paper SPR, which learns a latent space dynamics model, was one of them.
Max:There were also some papers that introduced data augmentation in RL. And those ended up being that ended up being very impactful as well. So very, very simple stuff. You know, you just jitter your images by 4 pixels in each direction, your performance magically doubles. So the paper for that would be, the reference for that would be Data augmentation is all you need, which introduced this agent called DRQ.
Max:So then that was sort of where the field stood in, 2020, 2021. We there was I I took some detours in between there. I worked on pretraining for a while and then came back in 2022 with a paper called The Primacy Bias in Deep Reinforcement Learning, where we introduced the idea of just resetting the neural network's parameters every once in a while. So just completely throw away or completely or mostly throw away the learning progress your network has made and just restart learning from your replay buffer. And it turned out that doing this improved performance really, really significantly on Atari 100 k, on DeepMind Control, from both images and states.
Robin:So what's going on there? We're getting better exploration because we're trying some new exploration after not getting burned in? I mean, what's happening?
Max:We thought that might have been the cause initially, but we were able to rule that out for the most part later on. Turns out it's, it's really about function approximation. If you which is, you know, funny to say because everything in deep RL is function approximation. But as you train, especially your critic, your value network, for longer, it just starts the network this is what people call plasticity loss. The network really loses its ability to adapt to new objectives.
Max:And the thing is your learning in RL is always non stationary. Right? Not only are you gathering new data, but every time you do a target network update, the actual function you're trying to match, the the thing you're optimizing towards, changes. And so what we saw was when you took one of these networks and just trained it for a while, it actually lost the ability to fit new target networks. And you can induce this incredibly straightforwardly.
Max:Like, just take a sack agent or something, gather maybe 10,000 samples out of your replay buffer into your replay buffer, train it for maybe a 100000 dot mini batches, it will essentially be unable to recover from that no matter what you do. If but if you just take so you could leave it running for, you know, a 1000000 steps or something and the performance won't improve. But if you just reset its parameters, all of a sudden it'll almost magically solve the task for most of DeepMind control. So it's really not about having better data. It's about having a clean neural network that's able to rapidly approximate new, new functions, which really means new target networks here.
Max:And then, yeah, we had another paper called Breaking the Replay Ratio Barrier that was at Iclear this year where we introduced the idea of scaling by resetting. And the real trick there is if you set up your resets so that you do one reset for every fixed number of, every fixed number of gradient steps, you can increase your Replay Ratio how many gradient steps you do per environment step pretty much arbitrarily. So we were able to go up to Replay Ratio 8 SPR for Atari 100 k and absurd replay ratios for SAC where it's much cheaper. We I think we hit 256, which are is about 2 which are respectively 8 and 256 times higher than the standards. And we saw really clear performance improvements as we did that.
Max:So, essentially, there's this whole new axis of scaling that we unlocked where just by adding this reset and standard reset mechanism and standardizing it. Wow. And then, yeah, BBF came along after that. So the the trick in BBF is take all of the improvements up to the up to and including the Scaling by Resetting paper and then add on network scaling as well. So bigger neural networks and all of the changes that you have to make to your hyperparameters, etcetera, to get those big networks to really live up to their full potential.
Robin:So what kind of networks are you talking about at this point with BBF? What kind of structures and the original DQN network was pretty simple, CNN fully connected.
Max:Exactly. So the the original DQN network was pretty simple. It was 3 layers of CNN and then 2 layers of MLP on top. The whole thing, depending on how you did your padding, how you set up, how many actions you had, that sort of thing. You could end up with, you know, maybe 4 or 5000000 parameters, but it was almost entirely in the MLP layers.
Max:And then what BBF had and pretty much every paper after that used almost exactly the same network for Atari 100 ks. The only exception was, actually, Data Efficient Rainbow, which used something even smaller. It was 2 layer CNN. And then when for BBF, we go up to something that's, you know, tiny, frankly, by modern standards still, but is dramatically higher capacity. It's a a ResNet based on the paper Impala.
Max:They're based on their ResNet, which we've, scaled up quite a bit, just made much much wider. So that pay that network has about 7,000,000 parameters which is, frankly, is not that much more than the original network did. But it's because it's now distributed across something like, 15 convolutional layers instead of 3, the overall capacity you get is dramatically higher. So we're not using anything like a transformer. It's pretty much just, you know, like I said, it's just a ResNet.
Max:I and I think for for transformers in particular, we're not at the stage really, this is on the transformer side more than anything else. We're not at the stage, I think, where it's possible to stably train those with the sorts of data sizes and batch sizes that you see in sample efficient RL yet.
Robin:Mhmm.
Max:But there are quite a few papers out there that have tried pretraining those on really diverse sets of data. I think there's one that came out from, Fair from Betai recently called Where Are We in the Search for an Artificial Visual Cortex? That just trains mass auto encoder on a massive number of different different domains. And so I suspect that if you took that and plug that into BBF, you could use a transformer if you really wanted to. But, yeah, for now, it's just, it's just resonance.
Robin:Let's talk about some of the improvements that, were used in BBF.
Max:Really, I think the way to look at our process for BBF was there there were 2 somewhat orthogonal avenues of direct avenues of development. One of them was to think about what you can do to have a help big networks recover from resets faster. Because we knew from our prior paper, Scaling by Resetting, that you really you would have to do resets to get good performance in the sample efficient setting. You know, the overfitting and plasticity loss that you see are just too extreme without them. And what we came up with there was this what we call annealing in b b f, where you change the the update horizon and the discount gradually over time after each reset.
Max:So you go from a regime where it's you converge very quickly but to a function that doesn't really reflect your actual objective very well to one where you're it's slow to learn but it's actually what you care about. So very low update this very low updates and high discount, update horizons and high discount factors. The other avenue was really to look at all of these papers out there that work on large sample RL and just think, you know, what ideas are there that we can get that are useful? So some of this involve looking at, say, Efficient 0, Mu 0. Frankly, it's a lot of DeepMind papers for the most part because they're the ones who train really, really large RL systems a lot of the time.
Max:And what you end up with there is things like much higher discount factors, weight decay, shorter end step returns is from essentially every large scale RL paper, and we found that that was very beneficial. Actually using a target network, again, very beneficial. Didn't matter much for standard Atari 100 k, but it matters a lot when you have a big network. And then we pretty much stuck all of that together and saw the performance results that we saw. Yeah.
Robin:Were there any dead ends in the research process or other things you considered that didn't make it to the to the final paper?
Max:Yeah. There were a couple of things that I I was pretty excited about that didn't end up mattering, and they were all based on varying things over the course of training. So one big thing we considered was varying the reset schedule over training. So maybe you wanna do a lot of resets early on and then fewer resets later in training. We were never able to see any benefit from that, not in a statistically significant way.
Max:And since it makes the algorithm a lot more complicated, we just decided to drop it. I'd say the other thing we looked at that we didn't end up including was, yeah, the other thing we end up not including was varying the replay ratio over training. So by the same intuition, maybe you'd want to train really, really hard with a very high replay ratio early on and then lower it over time. Again, maybe it helps a little, but it's not significant, and it makes the algorithm much, much messier. So we dropped it.
Max:I'd say the the only other dead end is, you know, there there are limits to how big you can make these networks. So if you can't just we we tried this. You can't just take, you know, a ResNet 50 or something and put it in VBF and expect to get good performance. So there's a there's a lot of room, even before we start talking about transformers, to try to push that horizon out to larger and larger networks. Then that's probably going to mean different tricks, more SSL, something along those lines.
Robin:Were there big surprises, as you went?
Max:Yeah. I think, network scaling ended up being dramatically more powerful than I expected. So you can run BBF at an extremely low replay ratio even with some of the tricks turned off, and it would still have been the state of the art Model Free Agent in, you know, March 2023. And I think the real key for that was, using the shorter update horizon, n equals 3 instead of n equals 10 or n equals 20. And the real the the reason I have that intuition is I think with a with a long update horizon, the function you're trying to approximate is actually too simple.
Max:Like, you're not really able to capture the impact of any individual action you took. All you have is this general vibe of, you know, one of the actions I took in these 10 time steps was good or one of the actions I took in these 10 time steps was bad. And that's for a small network that's not able to actually learn anything interesting, that's actually great. For a big network, that's absolutely fatal. And, sort of the reason I think this is because I had played a lot with larger networks in one of my earlier papers, including training from scratch.
Max:But I I had this paper called Pretraining Representations for Data Efficient RL back in 2021 where we pretrained some big networks on lots of Atari data and showed that, you know, you could fine tune those really effectively to get good performance on Atari 100 k. And I consistently found that training those those bigger networks from scratch was absolutely just absolutely useless. Had no seemed to have no potential whatsoever. And looking back, the only hyperparameter difference was I was using n equals 10 instead of n equals 3. So, yeah, some sometimes there's a there's some massively important thing just hidden in your config file that you don't even think about.
Robin:Seems counterintuitive to me. I would kind of just guess that a larger network would be able to handle a larger number of time steps in terms of being higher fidelity to model to model the complexity that that kind of entails?
Max:It is better at modeling higher complexity things. The important thing is that n equals 3 is actually n equals 3 is a a longer more time steps per gradient step. But in terms of if you think about the number of time steps, the number of optimization steps required to capture what happens in one episode in the environment, a higher update horizon actually reduces that. Like, I could set n equals 100, and maybe my entire episode is captured in I I propagate values all the way from the end to the start in 2 gradient steps. Where for n equals 3, you've really gotta have you know, it might take a 100 gradient steps to do the same thing.
Max:So you have to be really correct at every single one of the those gradient steps, at every single point in time, for the value information from the end of the episode to make it back to the start of the episode. And that's where big networks really are, I think big networks really shine, I think, and small networks really struggle. Where for, you know, if if you have a very long update horizon, say, 100, 200, a 1000, you're almost doing Monte Carlo learning at that point, and it's a pretty easy learning task. It's just not good enough for control. So one other thing that I thought was really surprising actually is that network scaling and replay ratio scaling because remember, replay ratio scaling is what I worked on right before this.
Max:Turns out they're almost exactly orthogonal in the sense that I can increase my replay ratio from 1 to 8, and that will increase my performance by some for if if for the standard SPR agent, maybe that increases performance by IQM called 0.3. And then for BBF, with a dramatically larger network, I can increase the replay ratio from 1 to 1 to 8. Also increases my performance by exactly 0.3, which really feels like something that just should not be true at some deep deep, deep level. Like, there should be some sort of diminishing returns, or maybe there should be some positive interaction. I I feel like it can't possibly hold outside of the regimes we're testing in, but so far, it seems pretty consistent.
Max:Like, you can train longer and that you can train with a bigger network, and that'll help your performance by some fixed amount. You can train longer or harder. That'll help your performance by some fixed amount. And it really doesn't matter what else you're doing. There are just these two fundamental axes of variation that just help or hurt.
Max:Yeah. I still don't have an explanation for that, and I really hope somebody can dig into that and figure out what on earth is going on there.
Robin:Do you think that part of what we're seeing is actually has to do with the nature of the ALE, the Atari environment, in terms of its specific regime of of, you know, horizon, of of episode length, of, you know, distribution of rewards, that type of thing?
Max:That would be my hunch. But what's weird about this is that, you know, BBF is in a very different regime than SPR was. SPR doesn't do better than humans on most games. BDF does better than humans on, you know, about half of games and dramatically better than humans on quite a few to the point where, you know, a lot of these games are solved. And yet we still see this pattern.
Max:So I, you know, I don't think there are that many other discrete control environments that I have the depth and diversity of Atari, but I really would love to see if this generalized on something that wasn't Atari. I just don't know where to go for it.
Robin:So back in episode 38, we, we talked to John Schulman, who of course invented PPO, a common, policy based algorithm, of course, and we asked him if PPO for RLHF, that's, reinforcement learning with human feedback, and this is back in 2022, if it was the same p p as PPO back in 2017 and he said it was basically the same PPO, it was the same algorithm. And then with these value based approaches, you know, starting even even before rainbow, there was a lot, you know, after he came in, it was just this kind of growing the sense of a growing bag of tricks that were needed to really squeeze out performance from the value based, methods. And so my question here is, like, do you think there's, like, something fundamentally different between policy based and value based methods? Like, like, value based methods have to be more complex in the end? Or or do you think that's not true?
Max:You know, it's a good question. I I think there is something to that. And I think I I I think if you were trying to take a policy based method and make it compete in the same regime that the value based methods shine in, which is really this, you know, extremely off policy sample efficient setting where you have limited data and you're training as hard as you can on that, what little data you have. I think maybe you'd have to introduce a bunch of tricks. But in practice, I'd say the the policy gradient methods have grown specialized, well, especially PPO, have grown specialized for essentially the infinite data regime where you have a simulator or you have something that lets you query from the environment as much as you like.
Max:And in that regime, it really seems like it's almost impossible to do better than them. So they're trying to do, yeah, they're trying to do different things. The for a policy gradient algorithm, you're just trying to find, you know, a steady improvement step that will get your performance higher. And you don't necessarily care that much about how big that step is. You just wanna make sure it's always going in the right direction.
Max:Where for value based, yeah, you're you're trying to push as hard as you can on improving performance as fast as you can, and you don't. It's not monotonic. You know, PBO is famously almost monotonic. You just your performance just goes up and up and up over time. And I I see this.
Max:I I use PPO in my own RL for science work where you do, in fact, have a simulator. And, yeah, we see the same thing. It just works. I I would never advise anyone who had access to a a full simulator to use a value based method. The flip side of that is, you know, if you've got 2 hours of data, PPO is barely even going to start improving.
Robin:Let's talk about model free versus model based. Like, are you surprised that, these types of model free methods like, BBF can get us this far with, with such little data. Like, I guess, I've always intuitively thought that, model based methods are gonna be more efficient.
Max:Yeah. I think I had the same intuition going into things. And, you know, actually, that that first SSL for our paper I did, SPR, that got started because I wanted to take Mu0 and add self supervision to it, like, with this latent latent space world model learning, and then show that that improved your planning. And then it, you know, it turned out at the end of the day for that paper that I could turn the planning off and my performance was the same or better. So we published it as a pure model free paper.
Max:So I I I have a long history in my career of being surprised by this.
Robin:And why is that? But we're very briefly. Like, is that is it because it's, you know, you captured everything so well in the value function that there's no there's no point trying to plan on top of that? Or
Max:I think that's that's kind of it. Yeah. So you should think of you should think of a replay buffer as essentially a perfect nonparametric model of your environment. In the sense that, you know, if I if I just wanna know what does a transition look like in this environment, I can just look into my Replay Buffer and pull a transition out. And that is exactly perfect.
Max:It doesn't it's according to sampled according to the distribution that I interacted with the environment with, but it is, you know, it's correct. And as a result of that, if you're if you have this huge replay buffer of offline data and you're training really, really hard on that replay buffer, that actually starts to look a lot like Dyna. So and there was a there was a paper actually, the paper that introduced Data Efficient Rainbow that really laid out this duality clearly. I mean, it was it was called when to use parametric models in reinforcement learning. And their point was, you know, hey, guys.
Max:A replay buffer is actually a pretty good model. You should just keep your replay buffer around. Now that said, I am surprised so so I'm not surprised necessarily that model free methods can match model based methods. I'm a little surprised by BBF in particular because for BBF, you can take away all of its self supervision and auxiliary losses. I think it's called, BBF minus SPR in the ablation in the paper.
Max:So it's it's purely doing reinforcement learning at that point. And actually, you see that it's still really, really high performing. So the thought that you could train a big network with just a value estimation loss with nothing else required for representation learning, yeah, that's that's definitely a surprise to me. I don't think anybody necessarily would have seen that coming 5 years ago.
Robin:Yeah. Back to the, the no improvement from planning, issue. Like, I guess some environments, in a sense maybe they don't need planning. Like like, I guess, alpha 0 had this this whole planning component because the go game is so complex that every every move shifts the value function so much that that I think is straightforward. Evaluation of a value function wouldn't capture all the nooks and crannies of that value function.
Robin:And so they they had to do this complicated thing looking ahead. But if we're trying to do Atari, there's a the landscape is is maybe much simpler, and we don't we just the the environment doesn't doesn't require that. Maybe maybe that's why planning wasn't needed.
Max:It's it's a lot flatter to the point where there there was a paper recently that showed that just to examine this in Atari and found that you could you could make the wrong action quite a few times, and it wouldn't actually change your value function at all because you would always be able to recover from it. Like, you know, I can do whatever I want with my paddle and pong, But as long as I get it where it needs to be in the last half second before the ball shows up on my side of the court, didn't really matter at all, which is the exact opposite of of go or chess where if I make 2 pointless moves I've essentially thrown the game at that point. Right? So there there were the the MuZero paper showed this really clearly where they had a had an experiment testing how important decision time planning was. So, like, do you do you do a planning around before you select an action in the environment, or do you just choose naively from your your prior policy?
Max:For Go chess, it's that extra planning is incredibly important. For Atari, my recollection is that it barely made a dent. So, yeah, from that perspective, I think, yeah, it makes sense that in this style of control, planning is not critical. That doesn't mean that model learning couldn't be beneficial, but you don't necessarily have to be have to design a planning heavy algorithm to get good performance.
Robin:Did you feel like these self supervised losses are are somewhat, like a middle ground between model based and model free? Or do you do you like a more strict definition of what is, model based, RL?
Max:So I I do think they're they're intermediate. And that's sort of what I was getting at with, you know, model learning can be beneficial. You know, I think you can learn a model and not use it and still benefit from that. And that's basically what the the SPR self supervised objective in does. You're you're predicting what states you're going to end up at in the future conditioned on your your actions in the agent's own latent space.
Max:But but we never use it for planning. It's pure it's simply actually doing this objective helps you to learn useful representations of the world. And I think you we we would expect to see this in general in a lot of cases. Right? Like, if I'm reconstructing even just the simplest thing you could think of, like reconstructing the next frame.
Max:If you've got a lot of data in a big network, I absolutely expect that to get you better representations than purely doing reinforcement learning. And then after that, the question of, you know, how much do you actually use that model for value learning, I think that comes down more to what environment you're in. Does it look more like a min max search problem or is it a more forgiving robotic style almost control problem where you you've got a lot of flexibility, a lot of possible screw ups before you've really you've really lost the episode. Because, you know, if you can avoid planning, you always wanna avoid planning. Planning's expensive.
Robin:So I guess we saw some of that with the back going back to the unreal agent, a number of years ago where there were some unsupervised losses, and that did seem to really help. But in that in that case they were really predicting concrete things about the environment or concrete value functions about the environment whereas as this in your case you're doing something maybe that seems much less intuitive like predicting the own the agent's own interstates, which seems, kind of a bit removed from the environment. Like, is it obvious to everyone that that should be a useful thing to do even though it's kind of divorced from the actual observations?
Max:That's a very fair question. And, you know, what what I saw when I worked on my, pre training paper, that pre training representations for a data efficient RL paper that I was talking about before, What I saw there is that pure latent space self prediction isn't enough if you're not also doing reinforcement learning. It's really the interaction of latent space self prediction and some other loss that grounds your representations in something useful, which, you know, for BBF, SPR, etcetera, that's that's value prediction right there. That interplay is incredibly powerful. And you can sorta make up for that by introducing another different objective, say, goal conditioned reinforcement learning, which you can do even if you don't have a reward you're predicting.
Max:Inverse dynamics modeling works extremely well for this. And so the, yeah, the interplay between self supervised representation losses is, I think, something that if you go look at papers in the space and, you know, really look at them, you'll see all sorts of signs of this being very beneficial. But it had I think it hasn't really seeped out into practice in RL yet. And, you know, we're we're guilty of this even in BVF. We've only got the one SPR objective.
Max:And and, you know, to be fair, some of it's because it's annoying to implement a lot of objectives, but it it can be worth it if you really care about performance.
Robin:So do you feel that model free or model based have any inherent either one has any inherent advantage in general kind of asymptotically or or otherwise?
Max:Yeah. I think I I'd say that as you're going toward the asymptote, I would expect model 3 to generally have an advantage in computational efficiency just because, you know, like I said, planning is expensive. A lot of the time when you're doing model based RL, you're learning something you don't necessarily otherwise need, like a a decoder in the case of the dreamer style of algorithm, for example, which is again, it's just expensive. But I think model based will, hopefully, will make up for that in an ability to better handle sort of multitask and nonstationary problems where you your model is the same, but your your reward function has changed or your reward function is somehow parametrized. Because in that case, you can potentially do all of the relearning for the new task purely inside the model without actually having to gather new data.
Max:That said, I I do sort of think that model based and model free algorithms are actually moving towards a position where they're surprisingly equivalent, which is just to say, as we work out the kinks in each class of algorithm, you get closer and closer to optimal performance. And what optimal performance, you know, sort of in the statistical sense really means is you're ideally in the in the limit you're always making exactly correct Bayesian conclusion about how you should update your value function from each interaction you see in the environment. Obviously, we never actually get there, but I think there's some interesting signs that we should expect to get closer and closer to that. There's a lot of work on, how efficient in context learning is lately and showing that things like large transformers, if you train them on enough data and actually, there was a paper that just came out very recently showing this for decision transformers, which are essentially doing, you know, imitation learning, that if you train them on enough data for long enough, they actually will essentially always make the correct Bayesian decision about what the optimal next prediction is given that what they've been conditioned on.
Max:And I think there's a there's a good reason to think that that's the reason they do that is just because they're they're converging toward optimal performance. Right? And we should expect to see something relatively similar in reinforcement learning just with with a potentially messier path to get there because RL is harder than just c.
Robin:So back in episode 14, we we featured, Arvind Srinivas who authors CPC and cURL and they which are other approaches for representations in reinforcement learning. And, and so I understand what you're doing is is is different here. Has can you comment on how is our understanding of what makes for a good self predictive representation for RL? Is that our understanding that grown, a lot, over these generations of agents?
Max:Yeah. Absolutely. And, you know, it's it's funny you bring up curl because that was sort of the focus of the deep RL at the edge of the statistical precipice paper that we talked about earlier. So basically the the claims made in the Kearl paper were somewhere in between wrong to fraudulent depending on whether you attribute things to just the authors making mistakes or, you know, actively lying or trying to hide things. But basically, that objective, that curl objective, actually harmed performance on all the domains they were testing on.
Max:For continuous control, Reddit actually funnily enough, the machine learning Reddit figured this out after about maybe all of 6 weeks after they put the paper on archive.
Robin:Oh, wow.
Max:For Atari for Atari, it was a lot harder to track down. We ended up only figuring it out in the process of preparing that statistical precipice paper. When we reran their what we thought was their experimental procedure for a 100 seeds per game. Got wildly different results, got in touch with the authors, learned that their actual procedure was very different where they'd introduced essentially this max operation over a bunch of different evaluations of the same agent which nobody else in the nobody else on the benchmark had and they didn't really talk about at all in the paper. And turned out if you applied that same max operation experimental procedure to, like, just to their baseline data efficient rainbow, it actually beat curl by quite a bit.
Max:So, yeah, I think we've we've gotten better at learning what good self supervised representation learning objectives are, partially because we've just RL has just grown up as a field. We we don't really seem to do stuff like that anymore, at least not nearly as often, which is great. I'm I don't think I can say this enough. I'm really, really happy with how the field has matured over the last few years. And in particular, I think you've seen a shift away from there's been a little bit back toward this, but we see we've seen a shift away from classical computer vision style self supervised learning.
Max:You know? I think cURL was essentially, with some caveats, applying MoCo, Momentum Contrastive Representation Learning, and towards things that are more RL tuned. So their SPR is really a Latency Based Dynamics Modeling paper, which is really RL specific. There are a bunch of objectives out there, some from Mila also, like proto value networks where you're learning a bunch of different value functions at once that I think capture the structure of reinforcement learning a little bit better. I will say, though, that there has I have seen papers recently using what are closer to classical computer vision style SSL objectives as a way to ingest extra nonreinforcement learning data to sort of reduce the domain generalization gap that you might otherwise come up with, run into if you're trying to deal with sim to real or multiple environments or something like that.
Max:And that that does seem like a good case for, you know, a masked autoencoder or a sim clear style loss. But in general, when you have reinforcement learning styles, sequences of observations with actions and rewards, yeah, there's a lot of fun stuff you could do there for SSL objectives that work really well in practice.
Robin:So speaking of latent space models, we featured, Dana Jarhafner, of course, of the Dreamer series of agents. And Dreamer version 3 has been state of the art and it's model based, of course. Can you, talk about the performance of BBF versus, the the model based and the latents latent model like, like Dreamer 3?
Max:Yeah. So b b f does, quite a bit better than Dreamer v 3 does on atari 100k. I I will say that to, you know, to be fair to Dreamer, I don't think Dhananjarr particularly tuned or thought about Atari when he was designing Dreamer v 3. It seems like his real goal there was solving Minecraft. So, you know, maybe if if with a few tweaks, performance will be closer.
Max:As it is, BBF gets above human normalized score 1, Dreamer v 3. I think I dug their raw scores out of their appendix for the pay for BBF, and it was about 0.6. So quite a bit lower, more comparable to, you know, the the generation of pay generation of agents before BBF from last year. Mhmm. You know, still respectable but quite a bit quite a bit smaller.
Robin:Yeah. But like you said, I think, I think the emphasis there well, with Danjar's Asian was also to make it very general and a a plug and play algorithm that could basically work for a wide variety of of environments as opposed to to tailor for for target 100 k.
Max:Absolutely. And I think he he did a great job of that, and I'm I'm I'm really impressed by the the Dreamer, the Dreamer family of algorithms overall and Dreamer v 3 in particular. And that that generality is something that, you know, it's something I'm sort of working on right now with But it's for value based methods where there's a little bit more thought involved in getting generality is what I would say. So I'm I'm pretty confident that BBF would work pretty well on most discrete control domains. But if you wanted to use it in a continuous action space thing where, you know, Joomer v 3 works fine out of the box, you would really have to change how you do your value learning.
Robin:Mhmm.
Max:Where for something that looks more policy gradient like, you know, dream a v three looks a little bit more like PPO, for example, and how the actual policy optimization works, the two settings are only slightly different. So it's it's fine either way.
Robin:How does exploration work in in BBF? Is it is there any innovation in that department since, the last few generations?
Max:You know, funny you should ask because the, the exploration innovation in is we got rid of the main exploration module people have been using in Atari for years called the noisy networks. And it turned out that getting rid of it didn't change things a bit. So
Robin:Oh, what's going on there?
Max:Yeah. So there's an interesting line of work that's come out over the last year or so on a phenomenon called policy churn, which basically is every time you do a gradient step in a value based method, you change your policy dramatically more than you would think. So for the classic DQN, one gradient step, my policy has changed on 10% of states. And remember, you could be choosing between as few as, like, 6 actions. But on 10% of states, you're you're choosing a different action after that gradient step.
Max:And as a result of that, especially when you're training at really high Replay ratios, what actually is happening is your policy looks stochastic because every time you're acting in the environment, it's different. So the upside of that is you don't need Epsilon greedy. You don't need noisy networks. You just need to train, and that's enough.
Robin:So you're getting some kind of, exploration exploitation mix just by by
Max:Exactly. And the the great thing about the great thing about policy churn is it's naturally I don't know if anybody has proved this yet, but you could sort of naturally expect that it's going to look like a softmax policy on your q values
Robin:Mhmm.
Max:Just because for a q value for me to think that action 2 is the new best action when previously action 1 had been best, I have to increase action 2's value above action 1 in that gradient step. Right? And how likely that is to happen is obviously somehow a function of how much worse action 2 was rated before the gradient step. Mhmm. So So you're going to select you're going to for actions that are really close together, you're gonna alternate between those all the time.
Max:Fractions that are seem extremely suboptimal, you'll just prune those and almost never try them. I I I hesitate to say that it's the ideal exploration strategy. I really don't think it is. But for settings that are not sparse reward tasks where you're pretty much you pretty much always have some reward signal leading you in the direction of higher performance, you know, at least, you know, you're getting at least a few rewards an episode anyway, then it it seems to be sufficient in practice, which is definitely a surprise. I don't think anybody looking at you know, for me, after SPR, I thought that exploration was the thing we would have to crack to get human level performance on a target 100 k, and that turned out to not be true at all.
Max:That is really interesting. It's a surprise.
Robin:And and and so as these agents are getting, you know, better performance through these generations, are the queue values also getting more accurate, or is that not relevant?
Max:That's an interesting question. Oh, man. I I honestly don't know. I my hunch would be that they are getting more accurate.
Robin:It's not the, obviously, the focus. Right? But, like, unless you're using it for evaluation.
Max:And I, you know, I have not actually taken and looked at the q values it's predicting. That said, yeah, my strong hunch would be that they're they're dramatically more predictive now of your actual long term return than they would have been with, like, the original DQN.
Robin:Mhmm.
Max:I don't wanna say that they're they're good per se. I I doubt that it would be anywhere near as accurate as it would be if you just trained progression with Monte Carlo. But I think it's probably not nearly as dysfunctional as the original DQN was just because, you know, we fixed so many things between 2015 and now, it seems almost impossible to me that it could have gotten it could have not gotten much better. It certainly couldn't have gotten worse.
Robin:We we featured some guests who focus on meta RL and and rl squared and things like that where where we're really learning the the agent is learning more of the mechanism. And, and so, for example, Jeff Klune was saying, you know, a lot of, our algorithms are these certain tricks that we discover, like, for example, the, I'm calling them bag of tricks, but they're innovations, like, on certain components that are combined, and that we we just discover through research. But, he was saying, like, what are the chances that we're gonna really discover all these these tricks manually? He put that at a very low probability, and so we need to find ways to find find discover these things. What do you think about this in terms of the trajectory going forward of, you know, manually coming up with more and more of these kind of small components or small tricks that that that, cumulatively are are improving our agents?
Robin:Do you think that that search is gonna be become automatic at some point? I mean, some circles people talking about AGI and how chat gbt is gonna going to, you know, change everything and provide us with AGI. But here on the RL side, we're really we're really we're really doing a lot of manual exploration, it would seem. What do you where do you think that balance is going forward?
Max:It's a really interesting question. I have some confidence in MetaRL, but the what I've always seen what I've always feared with a lot of MetaRL is that you're maybe even if you can come up with a trick, you know, meta optimizer way to something that helps performance, how do you transfer that to it's really like a a knowledge transmission problem, like, almost a AI culture problem, like, culture for the machine learning agents themselves. How how do you transfer knowledge from one generation of agents to another? And so doing this human exploration where, you know, I go off and train 300,000 EQNs and see what works and what doesn't and then write that up in a paper, that's a really that's a way that hopefully guarantees that all of these ideas survive and get implemented later on. Where for MetaRL to be really impactful, I think you have to come up with something that has the same property.
Max:Like, it it gets distilled down in some way that lets other people who don't have your agent, your weights, still benefit from it later on. So, I mean, we've we've talked about some of this in, like, the the how to do this, I wanna call it, hyperparameter agnostic transmission in our reincarnating reinforcement learning work, where the idea is really you take one agent and you just try to distill all the knowledge in that agent into another agent so you can keep the benefits of what someone before you did without being tied to their decisions. But I do think there's there's a lot to be said for just carefully experimenting with stuff, finding things that work, and putting them down in writing so that other people who come later can just look at it and directly implement it. That makes sense. Basically, a known language is just a really, really easy way to represent ideas.
Max:That's all it comes down to.
Robin:That makes sense. I guess if there's some innovation buried in some function approximator somewhere, what are the chances we could ever get that out?
Max:Exactly. Like, I I absolutely believe that some RL squared or MetaRL algorithm could come up with a dramatically more efficient learning algorithm than BBF if you trained it for as long as I trained agents trying to come up with BBF. But how do I give that to someone in a way that they can use later on?
Robin:And can you talk about what kind of domains, agents like BBF are are really best suited for? Obviously, we're talking about discrete action spaces and and maybe relatively small discrete action spaces and smaller sample sizes.
Max:Yeah. So I think there for domains that BBF would be best suited to, you alluded to the discrete action space idea. The other thing is BBF is really not designed for ultra, ultra hard exploration problems. So if you hand me Montezuma's Revenge like, I I know this because I ran it. BVF gets a score of 0 on Montezuma's Revenge.
Max:Now admittedly, that's only with 2 hours of data as opposed to, you know, 40 days or whatever the standard agents would get. But still, it's it's not going to do magically efficient exploration for you.
Robin:Like, something like even like something like deep sea? Because, I mean, like, 2 2 hours on Montezuma is maybe
Max:not bad. Sea. 2 hours on Montezuma. I think I mean okay. To be fair, I I personally played Montezuma for 2 hours, and I was not able to get anywhere near the, recorded human score.
Max:So I I didn't know where that number came from. I think I was barely getting a pad of the first room. Yes. The deep sea is another good example. Like, it's not is not likely to do systematic exploration.
Max:I do think you need another module for that. Mhmm. Not to say that you couldn't, like, take the BBF code and slap that module in there. And if you did that, I'm sure it would work better than slapping that module on top of Rainbow. You know?
Max:I I would think of BBF as essentially a bigger, better, and faster version of Rainbow. So any place where you could use Rainbow, you could use BBF, and it will probably perform quite a bit better. But that means probably no continuous control unless you figure out action space discretization for your problem, which, to be fair, might not be that hard. Action space discretization seems to be looking surprisingly powerful lately. But then, yeah, also no no hard exploration.
Robin:Sorry. Just to go back in the weeds for a second, I noticed, in one of the network diagrams, you're using this exponential moving average.
Max:This is an interesting one. Yeah. So we all of the work in Atari up until very recently used either no target network or they used a hard target network, meaning you just have a set of target net parameters you sit in a you leave in a box. And every 4,8000 steps, you take your current parameters, overwrite the target parameters, keep going. So you're training against this stationary version of your past self.
Max:And what BBF does, which is more inspired by work from, I would say, the self supervised learning literature so I think the first place I would have taken this from in SSL is probably the paper, Mean Teachers Are Better Role Models, which excellent paper, by the way. Predated a lot of the modern work on SSL. And then also from, bootstrap your own latent and from what people do in continuous control. You know, SAC, everybody out there has an exponential moving average target network. And what we found is that when you're trying to train a big network efficiently, that's actually really important.
Max:As you scale things up, having the right target network design, meaning EMA, makes the difference between your algorithm completely collapsing and, you know, being state of the art.
Robin:It's it's interesting to me that with you you you're getting to human level learning times roughly ish and yet and and exceeding that in in in many cases. But, of course, humans come in into a game with a lot of prior knowledge about things like what what objects are and and what actions mean and how geometry works and physics works and and and the idea of motives and and, and I'm I'm sure that gives them gives us an advantage with these games. And, of course, your agents don't have any of this background knowledge. So I wonder if you, if you have any comments on how far you think the sample efficiency can be taken with with, like, no prior knowledge versus, like, you know, when we get to the point where we're gonna need to know that a a skull is, like, probably not a good thing to touch or a key is some something that we actually do like generally. And, are we near that point, where we're gonna need prior knowledge to get any further?
Max:Absolutely. Yeah. That's a great question. And, you know, one of my friends told me about a bet he had made with a colleague at DeepMind of, will anybody get, human level performance on Montezuma's Revenge in the Atari 100 100k setting, you know, a 100000 steps, you know, I forget what it was the next few years, without prior knowledge. And, actually, I my side of that bet is you know, I don't wanna bet against myself essentially, but my side on that bet is essentially no.
Max:I really do think knowing what a game is is starts to be important, especially for sparse reward tasks, like Monozuma's Revenge, of course. Now what I would say about prior knowledge, though, that's interesting is, for a lot of these environments, I think humans are actually maybe held back a little bit, not so much by our prior knowledge, but also by our embodiment. So a human interacts with Atari, if if you're old school, by, like, physically reaching out with their hand and moving a joystick. Right? Or if you're playing it on, you know, retro games or wherever, like, you would go nowadays with a keyboard.
Max:And an agent doesn't have any of that. They're not moving appendages. They're really just you know, you've got this space of 16, 18 actions. You have 0 implicitly 0 latency. You just choose 1 every, at 15 Hertz.
Max:And that's something that humans actually can't really do. So, you know, I think if if you were going to try to solve Atari in the setting that a human has to solve, Atari, where you're moving a controller around, a physical controller to manipulate the environment at the the control frequencies and latencies that are implied by the the size of network that a human has, that's actually gonna be really, really hard without prior knowledge. So, you know, yeah, it it goes both ways. We're trying to because of this, our agents are solving a much easier version of Atari than than what humans have.
Robin:So even even the sticky actions doesn't doesn't really, even the score that much?
Max:I mean, sticky actions is what? It's a 25% chance of repeating the previous action, which means it goes on for an extra 1 15th of a second. I okay. I I admit, I am not good at Atari. Like, my agents became achieved super max performance a long, long time ago.
Max:But even so for me, I cannot do 1 15th of a second precision. Not even close. So I I really don't know. Probably the folks who set, you know, the the human true records for these can.
Robin:Unless it was, like, turn based. Right? Turn based Atari. Can you imagine?
Max:I mean, that's well, when you think about it, that's what our agents see. Right?
Robin:Yeah. Exactly.
Max:It's a it's a much, much easier setting. If you gave me turn based Atari, I would probably knock it out of the park. Now I would also take 8 hours to play one game of Pong, but knock it out of the park.
Robin:Yeah. It wouldn't have been good for business.
Max:That so so the other thing I wanted to get at with this, though, is, yeah, prior knowledge has it's been surprising how far we can get without prior knowledge. Like I like I said, I worked on pretraining back in the day on the intuition that you had to give these networks useful representations before they'd be able to interact in a sample efficient way. I mean, blows the results I had back then out of the water. And I think what we really didn't necessarily understand in, you know, 2020, 2021 was just how many individual little details there were that were wrong in our networks, from plasticity laws to not having enough regularization, just bad hype really bad hyperparameters. And, you know, individually, each one of those is a trick that has some small quantitative effect but not much qualitative effect.
Max:But when you really pile together all of these changes, it starts to have this huge qualitative shift where things that, you know, are small and irrelevant suddenly in the plural are extremely impactful. And I think that's something we should've people maybe didn't really realize back then, and we should we should've probably seen coming if we were thinking about it more.
Robin:Aside from the prior knowledge, issue, but just the generalization, ability of our function approximators, is that holding us back? Or do you have any sense of the balance, between how much of the regret remaining is is due to the weak function approximation we have or versus the rest of the algorithm?
Max:Alright. So if I have to lay down a bet, I would say I think we can get to about Human Normalized Score 2, so, you know, twice as good as BBF on average, just by fixing up residual problems with function approximation. That could mean just sticking together more tricks. It could mean bigger networks, different family of network. You know, maybe a nicely regularized transformer could do much better than the resonance we have.
Max:So, yeah, I think all of those together could get you to about and, you know, maybe that could also include some exploration bonus as well. If you just all of these things that aren't really prior knowledge but are just algorithmic improvements, yeah, probably can about double BBF's performance. If you really wanna go beyond that though, yeah, you're gonna need prior knowledge. So probably that's gonna mean figuring out how to use LMS for this realistically given how the field is looking right now.
Robin:Okay. I mean, I guess, things like, augmentation is kind of injecting some prior knowledge in a way. It's saying, well, if you shift things to the left and right, it doesn't mean anything different, which is kind of something that we know that the raw network wouldn't know.
Max:Exactly. Everything we're doing almost everything we do in our design is prior knowledge. I mean, even just exploration. Like, the network has no reason to think there's a positive reward anywhere if it hasn't seen any. Like, our my agents probably live and die on monozumers of ranch thinking that the reward function is uniformly 0.
Max:But when we put in an exploration objective, we're really saying, no. Maybe it's not. You should go look. I bet it's not. So, yeah, the the you there's prior knowledge everywhere, of course.
Robin:Can you say anything about what, what you're working on at the moment?
Max:Yeah. Probably am not allowed to go in any too fine of details, but I I am interested in prior knowledge, because I I do think that's probably the lowest hanging fruit for us right now. And I am, of course, interested in, you know, doing a bbfv2, sort of follow in Denider's footsteps and push things a little bit further. So, yeah, there's, hopefully, a lot of fun stuff to come.
Robin:Awesome. We look forward to that. And besides your own work, do do you wanna mention any other things that are happening in RL lately that, that you find pretty interesting?
Max:Yeah. That's a good question. I I've been really impressed by, obviously, by Dreamer v 3. You know, I really wish that there there are a few details that I think MuZero did better, and I really wish that Dhanajar could use his, status of DeepMind to get those MuZero details released for the rest of the world in a Dreamer v4, but I suspect that that's not going to be allowed. Other stuff that's been cool there's a really interesting paper that came out recently on designing reward functions for RL agents with LLMs, I believe, by Google Robotics gang, that found that you could get, you know, essentially, you know, it's all it's all down to gbt 4 being magic, but you can get a reward function that will lead you with model predictive control to the right per the right policy pretty much just by asking an agent to do something through GPT 4, which I thought was astonishing.
Max:And I think there's a lot of stuff potentially that could come out of that in the future. Because if you just ask GPT 4, for example, how do I solve Montezuma's Revenge? It's wrong, but it's not that wrong. And I bet if you showed it a picture of Montezuma's Revenge, it might actually get it. And then can it, you know, write a reward function that'll steer your agent to go solve it?
Max:I bet it could. So, yeah, a lot of lot of exciting stuff to come in the future, I think. I I really do feel like even though RL has sort of, lost the the limelight of the AGI community or AGI, I don't know, eye of Sauron out on Twitter, it feels like a very, very good place to be, as an RL researcher right now. A lot of fun stuff happening.
Robin:Do you do you, on that topic, do you do you think much about AGI?
Max:Yeah. No. So I totally I totally do think about AGI. You know, there there are all these papers out there claiming that GPT 4 is already AGI. I think there was the Sparks of AGI paper from Microsoft, and this extremely controversial one recently came in that I could pass all of the heart all the MIT classes that I think people thought was bogus if I recall correctly, but I didn't look into it in detail.
Max:But, you know, the the core message I'm seeing here is we're doing extremely well for purely textual domains. And I suspect that with the next generation of models, we'll do pretty well for any domain that is, you know, textual plus visual. It'll be great at some it may not be great at generating videos yet, but it'll be great at summarizing a video. And I think the Where We Need RL is to make sure that embodiment and interaction don't get left behind. You know, I don't want a world where every night every job that today is nice done by people sitting in an office gets replaced by GPT 4.
Max:And thus, everyone has to go and work in a farm field because that's the only thing AGI can't do. I really think we should be trying to automate away the worst jobs, which are all of, you know, farming, plumbing, etcetera, as fast as we can. And RL is really where that that comes in. Right? So, yeah, I I do think about AGI a lot in the long term.
Robin:So in terms of, RL being used for robotic control and and actuators and things like that. But I guess, of course, RL was it was a key component to, ChatGPT itself and then led to a lot higher performance, than without
Max:Absolutely. Without our own. I I think you can there there's some folks definitely who think that you could get chat gpt level performance without RL nowadays. I think a lot of the various LAMA plus plus type models are mostly not RL is my understanding. But I do think RL is extremely valuable the moment you start to have interaction.
Max:Tool use, for example, is a great seems like a great example of a case where RL would be good.
Robin:Yeah. I remember John Schulman saying that, our imitation learning was kinda doing the wrong thing in LLMs, and and RL was the only way to do the the actual right right thing. So so that maybe RL is is a really a key component to to making these instruction tuned LMS really, really high performance.
Max:I I do suspect that's true. Yeah. I I really wish I had a deeper sense of exact the exact comparison between the 2, and I think, unfortunately, a lot of that information is just you know, probably every everybody who needs to know this at OpenAI and DeepMind, etcetera, Anthropic does know this and just it hasn't really been released in detail. Because, yeah, there are folks out there claiming they can take, you know, a 1,000 example in extra instruction set and do supervised fine tuning on it and get great performance. And, you know, that seems a little questionable somehow.
Max:You know, we know that they did our own and it was important. So what's the trick?
Robin:I guess the question is great performance on what? Like, there's such a wide variety of difficulty of tasks, and I think there's some kind of glossing over sometimes of, like, what what exactly the difficulty level we're dealing with.
Max:That's that that's my hunch too, that the evaluation is just not there yet. We're we're seeing good performance on easy stuff and not on hard stuff.
Robin:Max, is there anything else you wanna share with our audience, today while you're here?
Max:Yeah. I I really wanna encourage everybody to who's in RL to, not despair. I know a few of my friends have been concerned by the shift in hypo towards LLMs. And I I really feel like this that's I understand that. I felt that a little bit too.
Max:But I think today is actually an amazing time to be an RL researcher. Like we were just saying, RL is very useful for these LMs themselves. And I think once people get a little bit bored with textual domains, and that's probably gonna happen pretty soon, there's gonna be a lot of attention out in the world on how do we get this stuff interacting with the real world, with embodied agents, with robots, even how do you put GPT 4 inside of Fortnite, things like that. And that's where RL starts to be really valuable again. See, I'd say now is a great time to be an RL researcher.
Max:And we should all just keep at it, keep the faith, keep pushing for a little while longer until things start feeling, at least on Twitter, like they're going more our way again.
Robin:Max Schwarzer, thank you so much for doing this interview and, for sharing your your insight with the tacro audience, tonight. Thank you, Max Schwarzer.
Max:Thank you so much, Robin. It's been a pleasure.