TalkRL: The Reinforcement Learning Podcast | Transcript: NeurIPS 2024

NeurIPS 2024 - Posters and Hallways 1

March 2, 2025 / 09:32/E63

Speaker 1: 00:00

So my name is Jiaheng Hu. I'm from the University of Texas in Austin. And, my new name's work is called disentangled unsupervised skill discovery for efficient hierarchical reinforcement learning. So unsupervised skills discovery considers the problem where agents can do reward free interaction with the environment and hopefully through interacting with the environment, the agent can learn a set of useful skills that can help it solve downstream tasks more efficiently. So the mo key motivation of our work is what we found is that previous unsupervised skills discovery methods can learn very useful skills, but when they're trying to repurpose those skills to solve downstream tasks, they usually encounter difficulties where the high level policy have a very hard time of learning what is the right skills to select at a particular state.

Speaker 1: 00:41

So, the way our paper tackles this problem is we manually inject a structure into our learned skill space where a part of our skill space only affect a part of our state space. So let's consider the example of driving a car. What what our what's what the skill space of our work learn is that the first dimension of the skill is only affect the velocity of the car and the second dimension of the skill is only gonna affect the orientation of the car. Importantly, what we find is that by injecting structure into our skill space, we can really enable the downstream task learning to perform much more smoothly and achieve much better performance compared to when you have a skill space that is entangled.

Speaker 2: 01:18

How do you tell your system about the different, dimensions that you care about? Like, are you injecting how do you inject the knowledge?

Speaker 1: 01:27

Right. So, important assumption we're making this work is we're given a factor state space. This can be a think of this as a dictionary observation, right, where you have, let's say the first observation is the velocity of my car. The second observation is the lighting condition of my car. So you don't need to know actually what each of the factors they correspond to, but you do need to have this kind of chunk to stay space which our method builds apart.

Speaker 3: 01:50

Hello. So I'm Skander. I'm a PhD at EPFL at the Claire Lab. So today, I'm presenting the representation of no trust. And, our work is showing that PPO suffers from the the deteriorating representations that break its trust region.

Speaker 3: 02:04

So it's known that non stationarity, is causing plasticity loss in networks. And, in this work, we do generalize it to PPO. And we ask a very important question, which is why is that, PPO despite its trust region still collapses? And so, we show that, when you have representations, non stationarity causes representations to deteriorate, to eventually collapse, and this makes the trust region useless because you do a trust region on the output. And when you have representations that collapse, then the output does not distinguish between the states.

Speaker 3: 02:41

And so what we show here is that if you intervene to regularize representations, for example, we present PFO, a method that, adds the trust region to the representations, you can actually, maintain good representations, which effectively then allows your trust region to be more effective and mitigates performance collapse.

Speaker 4: 03:01

Hi. My name is Adil Zouitine. My affiliation is, and our work is time constant robust MDPs. And, the goal of, time constant robust MDPs is to serve the overparameter of robust MDPs. And the idea is, adding a new constraint, time constraints.

Speaker 4: 03:20

So the variation between the dynamics between two time steps is bounded. So we propose, to, add this time constraint on all of robust reinforcement learning algorithm, and, we made some evaluation, and we showed that we have the same static worst case performance. And in the the average case performance way, way better. So we have the best of possible, obvious solution, obvious worst case solution, and better average performance. This paper came with theoretical property.

Speaker 4: 03:55

The solution, found in the time constant robust, MDP's problem, still capture the optimal solution in the in our problem, in the robust reinforcement learning problem.

Speaker 5: 04:05

I'm Soumyendu Sarkar from, Hewlett Packard Labs in Hewlett Packard Enterprise. We are the largest supercomputing company. So So what we're trying to solve is, the with AI and and the huge amount of compute, now the power is off to the roof. So yes. So, our paper is on Sustain DC.

Speaker 5: 04:23

There is benchmarking for sustainable data center control. It's focused on large, data centers. And we are, optimizing, the cooling, the, scheduling of loads, and the energy storage so that, so so that the supercomputing consumes less energy, less carbon footprint, and less cost, less water. And, what we're doing here is we are open sourcing a platform so that others can bring in their own models and evaluate their own algorithms, for the, for the betterment of the world. And, the seed paper for this got the best paper award in the climate change AI workshop last year in Europe.

Speaker 5: 05:06

And this is the benchmarking paper which democratizes this access to the world and also helps, others build the digital twin like Frontier, which still recently was the fastest supercomputer of the world. Like, you know, we are building, these, the Frontier, digital twin. Oak Ridge is leading the work. We are helping them in the process. So similarly, we are bringing in sustainability to the entire supercomputing domain with this open platform.

Speaker 2: 05:35

So can you say more about, like, what type of fidelity you are simulating the data center at? What what kind of assumptions there are? Or, or is that modular? How does that work?

Speaker 5: 05:44

It's a fantastic question. This is a a capacity model. So the what the capacity model does is, it it, it works on the, higher level control, which brings in higher level, optimizations. But then as you can see, we have an extendable, architecture where we go in into the schedulers to, work at the scheduler level, to optimize the workloads. So it's it's a multilayer, architecture.

Speaker 6: 06:16

My name is Matteo Bettini. I'm a PhD student at the University of Cambridge in the Prowler Club. And today, I'm presenting a work on, benchmarking multi agent reinforcement learning. This work was done in collaboration with, Meta and PyTorch, as part of an internship I did in Torch Rel. So the goal of this work is to basically, remove some of the fragmentation in the multi agent reinforcement learning domain where there existed many libraries with different, focuses on specific problems.

Speaker 6: 06:44

Our, goal is to unify, all these, architectures and, algorithms and tasks into one single library. And you we want to enable users to mix and match different tasks, models, and algorithms, according to their choices and as much flexibility as they want. We have a standardized configuration interface and a standardized, plotting, which is based on a Nirip's paper. And so this, should allow users to have the flexibility to choose any component they like and, have automatically reporting the results. As part of the efforts, we're also publishing on one DB some of the benchmarks, so that, you can compare your own research to our results.

Speaker 7: 07:22

Hi. I'm Michael Bowling. I'm a professor at the University of Alberta. I'm presenting, work today on Beyond Optimism, Exploration with Partially Observable Rewards. This work is built on this framework called the monitored MDP framework, which is built on MDPs inside of reinforcement learning.

Speaker 7: 07:36

But we're particularly looking at, situations where the reward is not always available to the agent. Now it still matters. That reward still matters. The agent still needs to accumulate that unobserved reward. That's its objective.

Speaker 7: 07:48

But it doesn't always get to see it. And so the real challenge here is how do we actually learn these scenarios? And more importantly, what this paper is about is how do you actually explore in these scenarios? The typical way of doing exploration and reinforcement learning is to use optimism. The idea is you actually try to take actions that you are less uncertain more uncertain about, so that you can either find out that they're really good or you discover how they actually how the world works under those scenarios.

Speaker 7: 08:11

And that can cause you to explore. The problem when you apply that inside of mon MDPs, our framework, is that what actually happens is that you might not actually learn anything even if it's not the right thing to do, because you might never see the rewards when you take those actions. So we present a different exploration algorithm. It's not based on optimism that actually didn't work in these scenarios where the rewards are not always observable.

Speaker 2: 08:33

Is this kind of setting apply to any real world scenarios? Or is it more theoretical?

Speaker 7: 08:38

No. I would say it is in fact trying to move RL to really reason about how we deploy our algorithms. Even take the most standard version of how we think about RL. We do some training phase, and then we fix that policy, and then we deploy it in the real world. That's actually a monitored MDP.

Speaker 7: 08:53

The idea is I see rewards in that initial phase, and then when I get to deployment, I don't see rewards anymore. So this actually is trying to capture the true way that our algorithms are actually deployed. There are other variants of this. For example, monitored MDPs can cover situations where maybe you actually get to see there's a lot of instrumentation in part of your environment where you can learn how the reward works there. But then the rest of the environment, there's not as much instrumentation, so you don't always see the reward.

Speaker 7: 09:16

And you can now have agents that can learn in this sort of playground space where they understand what's going on and apply that throughout the rest of the world.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere