TalkRL: The Reinforcement Learning Podcast | Transcript: NeurIPS 2024

NeurIPS 2024 - Posters and Hallways 2

March 4, 2025 / 08:48/E64

Speaker 1: 00:00

Hi. I'm Jonathan Cook, PhD student at Oxford University. And I'm presenting my work artificial generational intelligence cultural accumulation in RL, where we show that you can overcome primacy bias and premature convergence by making your RL training process generational. So we have multiple generations of social learning where I train an agent in the environment, freeze the agents after it's converged. I keep that agent acting in the environment, initialize a new agent in the environment, train it in the shared environment, and find that it does better than the previous generation.

Speaker 1: 00:40

And then I can repeat this process, and we find that this is able to overcome the premature convergence of the first generation, which is plateaued to some suboptimal performance. And are the

Speaker 2: 00:52

agents, motivated to imitate the previous generation? Or how does that work?

Speaker 1: 00:57

Yeah. The agents are not motivated at all to, explicitly to imitate the previous generation. The only motivation is the fact that they are trained with the same goal. So the previous generation is essentially, once it's frozen, part of the environment that is providing signal as to what I should be doing in that environment because it's been trained with the same reward function. So I have an agent essentially that I could if I were to blindly follow, I would be doing some degree of the task because I've trained that agent, to do so.

Speaker 3: 01:33

Hi. I'm Yifei from UC Berkeley. And today, I'm presenting our work called DGRL, training the wild device control agents with autonomous reinforcement learning. The the setting of this paper is to to train to get a Android a general Android emulator, and we want to train an agent to do whatever that we do with the gen with the general Android emulator in the like buying stuff from some websites and, like, get seeking some information from Google search. Existing approaches, such as, like, prompting g p t four o or doing supervised fine tuning on human demonstrations, they can be, suboptimal for this task, and our proposal is to use autonomous RL where we have an autonomous evaluator that's, basically serve as the serve as the success detector by only looking at looking at the last three time steps to to see whether task has been completed or not.

Speaker 3: 02:30

And we use this as a reward signal to train our, RL agent. And in our experiments, we have found that this autonomous RL approach can boost the performance from, like, 20% to 60% with around 1,000 trajectories, much better than using, existing, like, property models like g p four o or just supervised fine tuning on static human demonstrations.

Speaker 2: 02:55

Cool. And can you tell us more about the, two levels of value functions that you have here?

Speaker 3: 03:00

Yes. We we we validated the best design choices, the best and most simple design choices for using RL in this device control setting where we have two levels of value functions. One instruction level value function designed to find the task of the right difficulty for the agent to learn from at the current step for the, fast learning. And we also have a step level value function to find the most, to to to find the best actions that lead to the most promising states.

Speaker 4: 03:29

So I'm Rory Young from the University of Glasgow, and I'm presenting my work enhancing robustness in deep reinforcement learning, Lyapunov exponent approach. In this work, we're looking at the stability of trained reinforcement learning policies subject to local state perturbations. We start off by showing that if there's a small change in the initial states, the long term state trajectories tend to diverge quite significantly, and exponential rates as characterized by the Maxwell looking at exponent. So we show that for a number of state of the art deep RL policies in continuous control domains that they produce chaotic dynamics. And then to with just this we propose MLE regularization which minimizes the variance between predicted state trajectories for Dreamer v three.

Speaker 4: 04:20

We find this has similar performance to Dreamer v three standards, but it reduces the maximum of the panel of exponents quite significantly.

Speaker 5: 04:29

Yeah. Good to meet you again, Robin. I'm Glenn Berseth. I'm a professor of RL and robotics at Miele. And this research is basically figuring out how we can improve the convergence and stable training for deep RL algorithms.

Speaker 5: 04:42

And mostly what we found in a couple of recent papers is that we've been looking at this property called churn, where over the state space after every single gradient update, about 10 of all the actions will actually change for something like DQN, the greedy actions. So we've been looking at how much this affects training and convergence for algorithms, and we've developed a regularization method to basically reduce this churn significantly. And it results in much better performance gains for almost all our algorithms that we've been looking at recently, things like PPO and DQN and DDPG. And it also results in better scaling potential for a lot of algorithms, including PPO, which people are really popular about because that's been one of the more challenging ones to be able to scale lately. And the idea is is when we're chaining or we're training deep reinforcement learning policies, this churn also happens in actor critic methods a lot because your policy will update 10% of your actions change, causes more churn and divergence in your value function, and then your value function diverges and churns a lot as well.

Speaker 5: 05:43

So in this process, what we've developed is whenever you develop and train for a batch on your policy, also train on a held out batch of data that isn't used for that particular policy creating update, but it basically is used as like a KL regularizer. It says, like, okay, don't change these actions. You should only do an update on the actions in your current, like, value function update at the moment. And then we can see, at least in the graphs right now, this helps control a lot of the churn through different policies and results in basically better performance across, a version of chain across using DQN algorithms, using PPO on your typical Majoco benchmarks, and for using this for TD three and things like DDQN across our many Atari environments. And this also helps when we're able to scale our network up to about eight, sixteen times its normal size, and we still get much better convergence performance than other algorithms without it.

Speaker 5: 06:39

And at the moment, we're looking at some other updates on this and even being able to apply this to RLHF to see how much it improves training for those algorithms as well.

Speaker 2: 06:48

And this is, both for discrete actions and continuous actions?

Speaker 5: 06:52

Yes. Definitely. So because this has actually been, like, a really easy and simple method you can add to almost any RL algorithm, we just took after some original, you know, like, debugging and working this out in research, we're able to adapt a lot of the clean RL algorithms just by changing about two lines of code. We'd pull one more batch of data from the dataset, use a regularizer in here that just says, like, don't constrain the change in and out of, batch changes to the actual policy distribution. So, yeah, it works for TD three and SAC, which are continuous, and so is PPO, and also for DQN, which is only, like a discrete action type algorithm that is also value based.

Speaker 6: 07:34

Awesome. Okay. Sure. Thanks. Hi.

Speaker 6: 07:38

I'm Alex. I'm from the University of Oxford and, representing JAXmile here at NeurIPS. JAXmile is a combination of popular mile benchmarks and algorithms, but implemented in JAX. So we have, nine model benchmark environments and I think eight, algorithms. This allows you to compare either your new method, on a wide range of environments and against a lot of algorithms very easily, very quickly, massively lowering the amount of computation required.

Speaker 6: 08:09

So for one example, if you wanted to train on those popular SMAC or StarCraft benchmark, with JackSmile, it's now 31 times faster, or with MPE, another popular benchmark, it's 14 times faster. So, yeah, we'd love love for you guys to check out our work. We also have SMACS, which is an updated version of the SMAC benchmark, which corrects a couple issues with the heuristic AI. And it doesn't require you to run the Starcraft benchmark hence the speed up. Yeah.

Speaker 6: 08:37

Perfect. Thank you very much.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere