TalkRL: The Reinforcement Learning Podcast | Transcript: Abhishek Naik on Continuing RL & Average Reward

Abhishek Naik on Continuing RL & Average Reward

February 9, 2025 / 01:21:40/E62

Speaker 1: 00:02

TalkRL.

Speaker 2: 00:05

TalkRL podcast is all reinforcement learning all the time, featuring brilliant guests, both research and applied. Join the conversation on Twitter at talk r l podcast. I'm your host, Robin Chohan. Today, I couldn't be more pleased to be joined by Abhishek Naik. Abhishek was a student at University of Alberta and the Alberta Machine Intelligence Institute, and he just recently finished his PhD in Reinforcement Learning from there, working under Rich Sutton.

Speaker 2: 00:35

Now, he is a postdoc fellow at the National Research Council of Canada, where he does AI research on space applications. Abhishek, thank you so much for joining us.

Speaker 1: 00:45

I'm very happy to be here. I've been following your podcast for quite a while. I think I think ever since you have been coming to our Tea Time Talks, and so I'm very, very, happy and honored to be part of the discussion.

Speaker 2: 00:57

How do you describe your focus area?

Speaker 1: 00:59

Yeah. So as my PhD dissertation title states, it's Reinforcement Learning in the Continuing Setting Using Average Award. I think we can, yeah, talk about what both of those things are and why they matter, what I've done, and, yeah, we can get right into it.

Speaker 2: 01:16

What is average reward r l?

Speaker 1: 01:19

So average reward is a term that describes the mathematical formulation of the problem, right, and potentially also describes the solution methods or algorithms for it. To understand it maybe you have to take a step back and understand what continuing is because that describes the problems. Right? And so, in this case, a continuing problem is one in which there is a single stream of experience between the agent and the environment, right. So there are no time outs, terminations, live goes on forever.

Speaker 1: 01:50

So we are in Canada, we are very fortunate to have access to Compute Canada, which is this clusters of which are connected across Canada, across universities, and everybody can use them. Right? So I don't have to have I don't have to be a part of the university with a large number of resources. I can just use all of the resources that Canada has, which is great. But the thing is now, if people all over or students from all over Canada are submitting their requests for their jobs on the servers, the servers are limited in number.

Speaker 1: 02:20

Right? And so they have to make a decision at each point of time that which ones to which request to allocate right away, which ones to wait. And this will depend on, like, the priority of the requests. Like, maybe students have lesser priority than postdocs who have perhaps lesser priority than profs and whatnot. Right?

Speaker 1: 02:38

And there's a limited bandwidth that can be accessed. And so there is this decision making problem there and it just goes on forever. Right? Like, I was in India for a chunk of my PhD and I thought I would be submitting jobs at a time when people are sleeping in Canada, but, no, that server would still be crammed with a bunch of requests. Sure, a little lesser than in daytime, but still it was quite crammed.

Speaker 1: 03:03

Right? So that's a continuing problem in which decisions have to be made basically forever the servers are online or the job manager is alive or has not been put down for maintenance or something. Right? And same thing with, like, servers all over the world. Right?

Speaker 1: 03:20

Like, imagine, like, a Google server. Right? It has been bombarded with 1,000,000 of billions of requests the whole time and it has to make these constant decisions as to which ones to get through first and depending on some quality of service agreements with partners and whatnot. And maybe a more tangible example that we can see could be autonomous taxis or something. Right?

Speaker 1: 03:43

That in the near future, we'll be relying a lot on these self driving Ubers or Weimos and so what happens once they drop you off? Right? Like, there's no driver in there who is maybe go to a place in the general area where they want to have food or something and then get a get their food, but also catch someone who is leaving from that place to go home or something. Right? So but instead of that these gals will just keep making decisions as to, oh, maybe I should go to this part of town now because I there is a hockey game happening.

Speaker 1: 04:14

So there might be a bunch of people leaving from there soon, and so I should go there or things like that. There's a, I don't know, Taylor Swift concert happening or whatever. One example that I like to give as my starting slide of my PhD, defense or something would be that, imagine Mars rovers. Right? One of humanity's biggest ambitions is to terraform Mars.

Speaker 1: 04:38

And right now we have $1500,000,000 rover there operated, tele operated to a large extent by army of 200 scientists and technicians on the Earth. And that does not scale. Right? To Terraform Mars, we'll have need to have thousands of rovers out there. And, because there are communication delays between 4 to 20 minutes one way to Mars, they need to be operating autonomously and throughout their lifetime.

Speaker 1: 05:06

And this is literally a new terrain that no one has ever seen before. Right? These are the first things that are out there. And so it's not like a city like San Francisco, where you can send these your lidars mounted on cars for map every street and every nook and corner and then use those to train your algorithms. No, you can't really do that.

Speaker 1: 05:25

You you have these are the first things there. They have to learn and adapt to the surroundings that they find themselves in. Right? And so let's say one day a rover goes out, it bangs its one of its sensors on the side on a rock and it gets bent. Now the rover has to adapt to this bent sensor, the data stream coming through this bent sensor, like, imagine, like, a camera and learn to do things.

Speaker 1: 05:48

Right? And so they just have to do this throughout their lifetime. They there is no reset. If they die, they die. Right?

Speaker 1: 05:56

Just like humans and other animals. Right? And, yeah, all of this is in contrast to episodic problems, right, which we are a lot more familiar with and, like, chess is a prototypical example, right? Each game starts afresh. The consequences of moves are restricted to a single game, right, that you lose your queen in the 5th move and then you get checkmate by the 20th move, but none of these carries over to the next game.

Speaker 1: 06:22

Right? Everything starts afresh. But when continuing problems, there is no such clean separation. The consequences of your decisions could be throughout your lifetime, right? And for example, I decided a few years ago to pursue a PhD and instead of being like a hockey player or something, right, and I'm still, bearing the consequences of that.

Speaker 1: 06:47

Right. And I'll do that throughout my life. And anyway, so all of this to say that, I hope that clarifies the main differences between what continuing problems are with episodic problems and some examples of continuing problems to wrap your head head around what these are and what challenges they might give us.

Speaker 2: 07:07

Okay. So just let let me play devil's advocate for a second. So if I understand correctly, this continuing RL is is undiscounted. Right? It doesn't have that gamma factor that we're used to seeing, which discounts future rewards depending on how far away in future there.

Speaker 2: 07:23

Is that is that correct?

Speaker 1: 07:24

Yes. So I'm glad you asked because so far, I've just been describing a problem. Right? Like, these are problems that you want to solve. Now how you solve them can differ.

Speaker 1: 07:34

Right? Now how you solve it is the formulation, right, which I was referring to that. Like I said, average reward is a formulation to solve this problem, just like discounted formulation is a way to mathematically formulate continuing problems and solve them. Right? So these are just 2 different ways, and then we can now talk about the pros and cons and how to and what to.

Speaker 2: 07:53

Okay. But going back to that Mars rover case, like, could you not say that each Mars rover has is an episode? Like, the first rover is an episode, and then the second rover that lands is a separate episode, And they're just long episodes. And and then secondly, if you're playing cart pole, you could just let it play on forever, but it would still could still be using that traditional formulation. So why so could could endless carpool be framed as a discounted problem, episodic problem, like a like, you you train for the episodic case, and then you just extend it, and that's continuing also.

Speaker 2: 08:30

Can can you can can we be a little bit more can we zoom in on that? Why why is that so different? Okay.

Speaker 1: 08:36

Why doesn't that work? Well well, we haven't yet said that it doesn't work. But, yeah, let's get let's dive into it. Right? So what you're asking is, can I just string together a bunch of episodes, and is that not, like, a stream of experience?

Speaker 1: 08:50

And it totally is. Right? It from the point of view of the agent, it is just interacting with the world, seeing some numbers, and giving some output numbers as actuator output. Right? And that's all there is.

Speaker 1: 09:03

Right? So imagine, like, when in psychology, cognitive science, like, they do experiments with pigeons or rats. Right? And they use this notion of trials that, okay, now I put the rat in the maze and it goes to the end of the maze. That is trial 1.

Speaker 1: 09:21

Now I'll pick up the rat, put it in the cage for a while, maybe give it some some juice, some reward, something and then okay. Now I'm going to start over do the same thing again, right, which is my second trial. So in RL we call these episodes. Right? But, like, if you think of it from the point of view of the agent, in this case, the rat, it is seeing some juice at some point, then it is seeing some walls around it, and then it is navigating those walls.

Speaker 1: 09:45

And then it finds a piece of, I don't know, more juice or cheese or whatever that is. Then it sees it is in something where there are some wires around it. And then in a few more seconds, it it's back to seeing walls around it. Right? So from and from the point of view of the rat, this is just an uninterrupted stream of experience.

Speaker 1: 10:09

Right? We, as the experiment designers, like to think of it in this abstraction of episodes or trials where we are repeating, abstraction of episodes or trials where we are repeating a particular scientific experiment multiple times for reasons of generality, reproducibility, or whatnot. Right? But from the point of view of the agent, it is just an uninterrupted stream of experience. Now what happens within episodic problems is that we give the agent some extra information when an episode ends.

Speaker 1: 10:42

Right? We say, okay, either that we tell the agent in some mathematical formulations that, okay, whatever you have done so far, all the consequences of it end here. Okay? Nothing that you did now will affect what happens later. Why?

Speaker 1: 10:58

Because I am telling you that and I am telling you set your bootstrapping to 0, right, like set your do not carry, like, do not bootstrap from this end of the episode to the next episode or something, right? So in the game of chess you would say, like, you are implementing alpha go or something, and then you would say gamma is 0 at this boundary, episode boundary. Right?

Speaker 2: 11:21

It's almost like another Markov assumption there or something like nothing matters before now.

Speaker 1: 11:26

I would call it an assumption, but, yeah, it is definitely some extra information that we have, that we know about the problem, that nothing really no consequences go beyond this boundary, so do not do credit assignment across this boundary. This is something we know and we are telling the agent. Right? And in problems where we do know this, it's totally fine to give this information. Right?

Speaker 1: 11:46

Like, you can I would like to think that, in general, agents can figure this out by themselves? Right? Like, if I was learning chess and nobody really told me, after a point, I should realize that, hey. My queen dying in this game does not affect anything in the other game except what I've learned.

Speaker 2: 12:02

The the replay comes comes across the episodes. Right? Something is coming across. The replay buffer is coming across.

Speaker 1: 12:10

Well, yes. Let's not think of it from the lens of a particular solution method. Right? Like, so our current suite of methods, we use perhaps some artificial neural networks. And to make them work we might use these target buffer, target target networks and experience buffers and and these are a set of tools that we use to implement particular algorithms.

Speaker 1: 12:34

Right? But, yeah, let's if we take a step back and just think of it as algorithms on one side, you have to use some algorithm to solve a problem and what information goes to this black box of an algorithm. You put in experience at each point of time and you get an actuation at each point of time and that that is the input and output to this black box. Right? And how it is internally using it, it's is a separate thing.

Speaker 1: 13:00

So it might be it will have some value function, which is a form of memory, right, it has learned that sacrificing a queen is a bad thing, so it will try to not do it in the next game. Now this value function could have been learned through maybe some online learning without any buffer or it could be learned with the help of a buffer that that could change. But the point being that the agent is learning something about the consequences of it of its decisions through this stream of experience, right. And perhaps there is some extra information in this stream at some points of time about, oh, you can figure this out, but I am telling you nothing will carry over beyond this point, Right? So that is maybe just some extra information.

Speaker 1: 13:46

But, otherwise, it's just the agent figuring things out, looking at some data stream. Right? Whatever the internally the agent might be doing can differ. But from, if you zoom out, that's a black box that the agent, it is taking experience and giving outputs of actions, actuators, or something, right? So, in that sense, it's the same, whether you're solving a continuing problem or an episodic problem.

Speaker 1: 14:11

You still have the same framework of input and output. What only differs is, perhaps, some more information you give in the episodic case that I know that in this problem there is this very clean boundary beyond which consequences do not carry across, so credit assignment should not also carry across. So I'm telling the agent that. Right? And which is fine to do in problems where this is actually the case.

Speaker 1: 14:36

But in many problems of interest, it's not very clear where this would be. Right? So you were referring to the Mars example that can we think of, like, if you send 1 rover there or it was there for 5 years, it it is now decommissioned and now we have a new rover there and sure if they have similar things that you could have learned across them, you can think of this an episode but at some, like, we have to realize that all of these are abstractions for us to think clearly about the problem. Right? It does not necessarily be the way in which the agent is seeing it.

Speaker 1: 15:15

Right? And so, like in the pendulum problem, for example, right, it's a classic control problem and in the episodic case what we say is, oh, you start off with the pendulum hanging down, find a way, find a set of actuations such that you can make it, bring it to this unstable equilibrium of pointing up. Right? And as soon as you can do it and you maintain it for a few time steps, episode is reset. Okay.

Speaker 1: 15:41

So now a continuing version of this problem could be you start from the bottom. You again try to bring it to the top. Now you brought it to the top but maybe there is some exploration, maybe there is some wind, the panel falls over and now you have to bring it back up. Right? And so you you might have to just you can think of this as you can just keep doing this forever.

Speaker 1: 16:03

Right? And your if your reward is the most when it's at the top position and stationary, the best policy would be one which can repeatedly bring it back up after it has fallen due to reasons of its own, like of exploration or external reasons like some noise, wind, whatever that is, right. So yeah, some problems you can think of as, there is a clear way in which things can translate. For example, like a lot of Mujoco problems like hopper, walker, are essentially do this thing as much as possible, right. In this case, like walk as much as possible or swim as much as possible.

Speaker 1: 16:43

And then we can say, okay, you've done it for long enough, I will stop you at 3,000 steps and then we will restart. Or you could just let it be, right, because that's essentially what the problem is and that's the continuing version of the problem. If you add these terminations or resets in between, then you are making it an episodic problem with but it could just happen in continuing problem. Right? And so, yeah.

Speaker 1: 17:09

So some problems are better formulated as continuing problems. Others are well formulated as episodic problems and, yeah, depending on the problem you want to solve, you can make these decisions as to what it is best formulated as and, yeah, then try to solve it.

Speaker 2: 17:26

And you mentioned in the Mars River example, if a sensor gets damaged. And can we can we interpret that as, like, if the MDP changes?

Speaker 1: 17:37

Yeah. Yeah. I mean, sure. Again, so MDP is also an abstraction and, yes, we can think of it that there are some transition dynamics of the MDP have changed. But then, yeah, one thing to note here is that within a lifetime of a single rover, right, so we were you were mentioning across rover, so let me zoom into, like, the lifetime of a single rover.

Speaker 1: 18:02

Right? You might think that, let's say, there is some base station where it gets charged, like a Roomba or something. Right? So every day or sol, as it is called on Mars, it can go for an expedition, find interesting rovers, rocks to zap, and then come back in the evening to get charged. Right?

Speaker 1: 18:22

And do this over and over again. And you you can think of this as it's it's, yeah, it's very, what's the word, it's very, yeah, it sort of fits into this clear boundary case that we have. Right? Like, these are different games of chess. You do something over time and then you reset.

Speaker 1: 18:41

But then this sensor thing is different. Right? So in a particular day or so, the rover goes out. It's it hits one of its camera sensors on a rock, and now the sensor gets bent. Let's say it's leaning at 45 degrees instead of upright.

Speaker 1: 19:00

And that day the rover comes back, gets charged. Now the next day it's not reset. Right? Like, it's not back to its initial condition, as it would be in chess. It still has this bent sensor.

Speaker 1: 19:12

Right? And now from this point on, it has to deal with a new sensory motor experience of a bend sensor and figure out how to use that to make the decisions of which rocks to zap and whatnot. Right? And so, clearly some action you've taken in the previous day is still affecting you today and tomorrow and the day after. Right?

Speaker 1: 19:35

So that's where the, maybe it doesn't become as clear to think of it as an episodic problem, but more of there's one lifetime, the agent is continually do making decisions and, yeah, it has to do it till the end of its lifetime.

Speaker 2: 19:50

How did you get interested in in these topics?

Speaker 1: 19:54

Yeah. So this was the 1st year of my PhD. I was taking this course that Rich offers called RL 2, and which basically goes over part 2 of the book. And in part 2 of the book, there was a couple of very interesting sections. Right.

Speaker 1: 20:10

Till this point, like, I had had some RL background in my masters and and to me, my understanding was quite new, rudimentary, and like, I was thinking Q learning was synonymous with RL to me. Right. And so there is this discount factor, which is baked in and that is what my impression was. And then there is a section on average reward in that second part of the book. Okay, so this was new and we went through it in the course.

Speaker 1: 20:36

And then there was a following section which said, discounting should be deprecated, right. And I was like, wait, what? Like this is questioning the basic things that I thought I knew about RaaS. Right? And so I spent a lot of time reading about why Rich is making that argument.

Speaker 1: 20:55

Right? Like, so two things, like why discounting should be deprecated and why average reward is an alternative. And so I spent a lot of time, like I was fortunate that Rich was my advisor so I could talk to him beyond the class. Right? And so we had a lot of discussions and, yeah, basically, after these long discussions over a long, long period of time, it convinced me that this was an important, intriguing, and under explored topic, Three things which make it perfect for a PhD topic.

Speaker 1: 21:25

Right? That it's interesting to you as a student. It's important because it can have a lot of implications, and it's underexplored. So there's a lot of things to choose and do, right? And so, yeah, that's essentially how I got started along this line of work about continuing problems, about average reward and and yeah, I like to think that the papers that I have written over my PhD are sort of an encapsulation of my learning journey.

Speaker 1: 21:57

Like it started from like a New York's workshop paper of me explaining Rich's argument to myself and to the larger world that why discounting doesn't make sense in this particular problem of AI where you're doing continuing control with function approximation. So it started there, then to writing papers about new average reward methods that address some limitations with the previous methods and now, yeah, all the way to, yeah, writing my dissertation where it covers all of these different things and different aspects with eligibility traces, multi step methods, learning, planning, on policy, off policy, and all of that. Right? So yeah. It's been fun.

Speaker 1: 22:40

It's been fun.

Speaker 2: 22:42

And reward centering. Do you wanna talk about that?

Speaker 1: 22:45

It's a culmination of a couple of lines of thinking. Right? So reward centering is essentially it's different from what my main PhD topic is, which was doing reinforcement learning for the average reward formulation. Right? In this work, this is improving discounted reward methods using ideas from the average reward setting.

Speaker 1: 23:13

Right? And what is the idea? It's very simple. It just is when you are solving containing problems using discounted methods, just learn and subtract the average reward. And that makes all of your methods much, much better.

Speaker 1: 23:29

Right? And that is what the paper shows. So it shows it in, like, 2 different ways of improvement. So the first one, which probably is of interest to a lot of your audience would be that we all have faced this problem that when you are using a discounted RL method, you start with some discount factor and then performance improves and then but beyond a certain point, if you use too larger discount factor, things become very unstable. It's takes forever to learn or doesn't learn at all.

Speaker 1: 24:02

Right? But if you do it reward centering, you can keep increasing your discount factor with almost a monotonic improvement in performance, right. Which is, which was, yeah, which is, which is great. Like I wish something like this existed when I had started because it made my life much easier when working with trying to use RL on particular problems of my choice. Right?

Speaker 1: 24:25

So that was that was a big deal. And yeah, we can talk about what enables this. And then the other implication that comes out of centering your rewards like this is that now the agent becomes agnostic to, the any shift in the rewards that are that might exist in the problem. Right? So we are now talking about agents which are living their whole lives, like perhaps a Roomba that will will be in my house for like 10 years and it will go around, it will go to new play, new homes that I move into and learn and adapt to all that.

Speaker 1: 25:04

So it has to continually learn and adapt across its lifetime. Now within the lifetime, the kind of reward signals that it sees could be very different, right. So it could be in some part of the world, it's minus 1, zero, ones. In other parts of the world, it's like around 10. In some parts of the world, it's where it's where there are more penalties, let's say, then they are around minus 100.

Speaker 1: 25:26

Right. And so there could be a difference in the kind of reward signals that you see across an agent's lifetime, and we want the agent to be robust to this any shift in the reward signals. Right? And but that is not what we see with the standard algorithms when we use q learning, Saraswati style of algorithms, they are very much affected by where the rewards are. And but, yeah, we want it to be robust.

Speaker 1: 25:50

And with reward centric, it doesn't really matter where any shift in your rewards, the agent is very robust to that. So, yeah, that are those are the 2 things that come out of this very, very simple idea, which is just learn the average reward in your stream of experience and subtract it when you're making updates. That's all. And then you get these large significant benefits. And we demonstrated this in a few quite a few problems with tabular linear and nonlinear function approximation.

Speaker 1: 26:22

And yeah. So that was that was a fun paper to write.

Speaker 2: 26:25

I encountered your work on reward centering at, the RL conference in Hamherst early in just a couple of months ago, and I was honestly blown away. It's so rare that you find something so elegant and so widely applicable and with such a big impact. So I think this is just an incredible achievement you have here with reward centering. I can't say enough about it. Very exciting stuff.

Speaker 1: 26:50

Yeah. Thank you for your kind words.

Speaker 2: 26:52

Oh, absolutely. I mean, everyone's gonna I I don't see why everyone just wouldn't straight adopt this and all the methods. As you pointed out, it applies to pretty much all the methods. Right? You can just add this.

Speaker 2: 27:03

It can be combined with all the other great ideas out there.

Speaker 1: 27:06

Yeah. Yeah. So it's not taking anything away, and it's just adding. So it's just like this module that you can drop into any algorithm that you're using to solve continuing problems. And, yeah, it will it will work out great.

Speaker 1: 27:18

Everyone should try it.

Speaker 2: 27:19

Absolutely. Super rare to find something this this fundamental really at the stage of the game, I would say. So congrats on that. And is there a connection between reward centering and the average reward? How and how are they connected?

Speaker 1: 27:32

Reward centering is sort of a neat connection between the discounted reward methods and the average reward methods. So how does that happen? So let's say you have your standard discounted method. Let's I'll take q learning as an example. So let's say you're standard Q learning on one side.

Speaker 1: 27:48

Now you do this thing or you add this idea of reward centering to Q learning. Right. So now you're computing and subtracting an average reward. Now you have now because you're using standard Q learning, you have some discount factor, which you are setting, right, and you can set it, it's a containing problem, you can set it all the way from 0 to 1 inclusive. And it turns out if you set it exactly to 1, then the algorithm that you end up with is exactly the average reward algorithms that I proposed earlier in my PhD.

Speaker 1: 28:19

Right? So it's sort of like in the algorithmic implementation sense, Centring sort of creates this bridge between discounted and average reward methods with this factor gamma, right? For any gamma less than 1, it is Q learning with reward centering. If gamma is 1, then it is our average reward Q learning method, which we call differential Q learning, right, in that 2021 ICML paper. So so it's sort of a bridge between these discounted and average reward methods and yeah, the interesting thing that we saw was that, as you increase the discount factor, performance kept increasing in the problems that we tested it on, right, and which, so I did not say this out loud in the paper but it sort of then imply, begs the question that, well, if you can keep increasing gamma with the monotonic improvement, then why not just set it to 1?

Speaker 1: 29:19

Right? Then why even use or tune the discount factor as a parameter if if in a lot of problems you are seeing that if you just set it to 1, you get you don't get any loss of performance. In fact, it is as good or better than what you have seen so far with 0.9 or smaller discount factors. But then, yeah, that's where this connection between the two ideas is, right? That reward centering is bringing an idea from the average reward formulation, which is to compute and subtract the average reward to the discounted reward formulation.

Speaker 1: 29:52

And but if you take the idea to its extreme, set gamma to 1, then it essentially becomes the average reward algorithm all over again.

Speaker 2: 29:59

Now I see you with average rewards, you added a new parameter, eta. Can you tell us about eta? What does it do and why is it there?

Speaker 1: 30:06

Eta is just a step size parameter that you use to learn the average reward. Right? So so let's say alpha is your step size to learn the values. Let's say in Q learning. In okay.

Speaker 1: 30:21

So sorry. Can we pause here? So there so there are 2 things. Right? So let's say we are having this conversation in 2022.

Speaker 1: 30:28

There is no reward centering yet. And so far, it's only differential Q learning, which is the average reward version of Q learning. Right? And so it is learning the values and it is learning this average reward. And to learn the average reward, so you basically have 2 steps, right?

Speaker 1: 30:45

You can consider alpha for the values and a beta for the average reward. And the same thing will also apply when you are doing reward centric. Right? In reward centric, you are learning the discounted values, not the average reward values, but you are learning these extra average reward parameter. Right?

Speaker 1: 31:01

So so when learning that, again, you will need a step size parameter to do that, and that's where this would come in. So maybe I'm thinking that a good segue to this conversation would be that maybe you can ask how is reward centering is it any way connected to the average reward? What end up with new average reward algorithms? Reward centering is an idea to make discounted reward algorithms better. Okay.

Speaker 1: 31:31

So there's the suppression there.

Speaker 2: 31:33

One thing I thought what I found very confusing when thinking about average award is that the it seemed to me like the average award is not really affected by the immediate reward in a sense at all because the denominator is is all time. So if I get a reward of 1 of 1 in this time step, how does that affect the average reward? Because the average reward is is over infinite time. So is that a paradox or is that is that my just being overly simplistic in in thinking about that?

Speaker 1: 32:03

Yeah. So so there is no paradox because so so you are the agent. At this point of time, you have seen some rewards. You have some notion of your average reward and then you see some new reward. Right?

Speaker 1: 32:15

So now you use that to update your estimate of the average reward. Right? And so depending on what your step sizes is, the it could be a big jump or a small jump or whatever that is, but it will get affected by the immediate rewards that you see. Right? Because you are updating the estimate at each point of time.

Speaker 2: 32:33

Because you're using TD learning. Is that why?

Speaker 1: 32:35

No. I mean, even You're

Speaker 2: 32:36

using TD learning to Bid out. Right? The average?

Speaker 1: 32:39

So even if you are not doing that, right, even if you are just keeping running average of your rewards, the running average would be affected by each award that you see at the current time. Right?

Speaker 2: 32:49

Right. But in a continual setting, it could be an absolutely microscopic effect on the the average award, which is unlike the discounted case. Is that is

Speaker 1: 33:00

that right? So I so I wouldn't really say that. So in a sense okay. So let me take a step back. The within the average reward formulation, there are 2 things you are learning.

Speaker 1: 33:09

You are learning the average reward over a long time, and you are also learning the differential value function. Right? Which is the value function equivalent in the Average Reward case. So the average reward is supposed to capture what happens in the long term. Right?

Speaker 1: 33:26

And what happens in the short term is captured by the differential value function. Right? So you are not, you don't, you have all of that information to make locally optimal decisions through the value function and the ability to make decisions, which are best for the long term, come through the average reward. Right? You the objective is literally to maximize the average reward, and while doing that, you make locally optimal decisions by using your value function.

Speaker 2: 33:56

Yeah. Can you say more about, about that differential value function? What is it telling you? If it's if it's above 0, if it's below 0, what is it really saying?

Speaker 1: 34:04

In plain English, the average differential value function answers the following question, how much more reward do you get than on average when starting from this state and following this policy pi? That's the state value function, right, and this comes from, okay, so after I finish saying that, we can continue as follows. So this captures what happens in the short term, right, that how much more do you get on average because after a while, you'll be doing as good as average. Right. Like, is that what average means?

Speaker 1: 34:39

Right. Like, for a long term, how much reward you are going to get, which is what the average reward term captures. Right? So the differential value function is saying how much more reward you get on average when you start from this state and follow this policy. And so from some states, you'll get more reward than on average if you start here.

Speaker 1: 35:00

From other states, you'll get less reward than on average. So that's positive versus negative and in the long term, you get average. Right? So that's why

Speaker 2: 35:09

Okay. And now, in your dissertation, this phrase, average cost TD and differential cost TD, these two phrases came up. Can you briefly talk about the what what are they and and how are they different? They seemed I I couldn't quite, at first glance, understand the difference. Right.

Speaker 1: 35:27

So average cost TD is average reward learning algorithm proposed by Sid Sickliss and Van Roy back in 1999. Right? So they called it this and which is why I call it by the same name in my dissertation. So that's average cost t d and then the new algorithms, average load algorithms in my dissertation, we call them the differential family of algorithms. Right?

Speaker 1: 35:52

So for the prediction state value case that is differential TD learning. So the difference between Tsitsiklis and Van Roy is average cost TD learning and our differential TD learning is the way the average reward is learned. The way the average reward is learned in average cost TD is by simply having a running average of the rewards that you see. Right? And which is the most obvious and, like, that's the first thing you would do.

Speaker 1: 36:22

Right. Like, if you want to compute the average of of the rewards in your data stream, you just maintain a running average. Right. Beautiful. It works.

Speaker 1: 36:29

But what if you want to do things off policy? Right? Like, if you are if you are just doing everything on policy, which means you are behaving exactly according to the policy that you want to evaluate, the running average of the rewards will exactly be the average reward of the policy you want to evaluate, but if you are doing learning of policy, which means you are, let us say, following a more exploratory behavior policy, if you just average the rewards that you see, it will it might not be the average reward of the target policy that you want to learn. Right? Maybe that's the optimal policy to behave or whatever that is.

Speaker 1: 37:07

Right? So what do you do if you if you are behaving off policy? And so our answer to that question was to learn the average reward also using the TD error just like we learn the values. Right? And so instead of just having a running average of your rewards, you, have, like, a running average of your t d error, so to speak, and we show that that will that is a unbiased estimator of the average reward of the target policy, which is exactly what you want if you're doing off policy learning.

Speaker 1: 37:41

Right? So that was the difference between, since Sickliss and Van Roy's average cost TD learning and our differential TD learning.

Speaker 2: 37:49

And then you write the $1,000,000,000 question is, for continuing problems, should we use average reward algorithms or discounted reward algorithms? $1,000,000,000 at least. Right? I mean, if you think of the space economy. But so how do you think about that question?

Speaker 1: 38:06

Okay. So discounted RL is not well defined theoretically when you are doing large scale continuing problems, right, like continuing control with function approximation, but in practice, we can still try and use it as a solution parameter. Right? You can just in your algorithms, you can have a discount factor gamma, you can set it to whatever and try to solve the problem, and then what we saw that reward centering makes standard discounted algorithms works much better by computing and subtracting the average reward and you have close to monotonic improvement all the way till 1. Right?

Speaker 1: 38:40

So why not just use 1 directly. Right? Set your discount factor to 1, use that directly, which was sort of pointing to, oh, just use average reward algorithms. Right? But then, yeah, I think I think, the ideas from discounting are still useful.

Speaker 1: 38:57

Right? Like, okay, if I had to say one thing, I could say, do not use discounting as part of the problem. You can use it as a solution parameter and perhaps that can make learning a little bit faster, right? So what if imagine a word, imagine an algorithm, where you can adapt your discount factor over time. So let's say you are just starting to learn, you don't really know anything about the world, there's a lot of uncertainty, you can have a smaller discount factor, learn something quickly, and over time, as you get more confident about your predictions of the world around you, you can increase your discount factor, and it can go all the way to 1, thanks to reward centering.

Speaker 1: 39:34

And then, again, if you are in parts of the space where you are, again, uncertain about, maybe you can make reduce it a bit to learn quickly about that part. So why do I say learn quickly? I mean, imagine if your discount factor is 0. Right? You will learn very quickly what to do to to make one step optimal decisions.

Speaker 2: 39:54

The greedy thing. And you're saying this because gamma is often given as part of the problem description in a classical sense. Right?

Speaker 1: 40:01

I mean, whether it's given or not, I think, peep a lot of there's a lot of inclination to think that it is. Right? And but all I am saying is, yeah, think of it as part of, like, it's a part of your solution toolkit whether Mhmm. What you can set it to and

Speaker 2: 40:17

Yeah. But you could change it.

Speaker 1: 40:18

And then I am saying, yeah, then you can change it, and the idea of wanting to change it is not new. Right? Like, people have wanted to do this. It's just that in our experience, if you increase it to close to 1, things just don't work. Mhmm.

Speaker 1: 40:32

But now that is very much possible due to reward centering. And what reward centering also enables, you can let's say, you have some discount factor gamma 1 at this point. Right? And you'll learn something according to it, and you are thinking, oh, what if it was gamma 2? Right?

Speaker 1: 40:46

If you are doing the standard q learning, you would sort of have to learn over, right, about the predictions according to this new discount factor gamma 2. But since you have since reward centering lets you separate the the discounted value function into this differential part and the average reward part, you can quickly bootstrap your predictions for this gamma 2 and then make them better by interacting with the world, which I think is a big which would be a big speeder from trying to do it in the more traditional way. Right? And, anyway, all of this to say that we can now imagine algorithms where what you want is to get most as much reward as possible, which is what the average reward formulation is with gamma equal to 1. But to to learn quickly in some parts of the world, you can set your discount factor lower and then adapt your discount factor over time very quickly through reward centric.

Speaker 1: 41:45

To to understand this a little better, maybe we should discuss this one thing about reward centering, like like, what is the basis of reward centering? Why does it work? It works because it it decomposes the discounted value function, as I mentioned, into these two parts. Right. The average reward and the differential value function.

Speaker 1: 42:05

The average reward part, so this decomposition was shown back in the sixties by Blackwell. Right. And, but and their decomposition actually is the discounted value function is equal to average reward divided by 1 minuteus gamma, okay, plus the differential value function plus some error terms, okay. So, this I want you to, take note of this first term, right, the constant term average reward divided by 1 minus gamma. Clearly, as gamma increases, this term is going to blow up.

Speaker 1: 42:38

Right. It's the average reward by 1 minus gamma as gamma is closer to 1, this term becomes very large, which is which is what you see. Right? That if your discount factor is too close to 1, your value function is very large in magnitude, your discounted value function. Reward centering is saying, learn just the average reward, not it being scaled by 1 minus gamma and the differential value function, learn those two things separately and that is enough it captures enough information to make the decisions without having this large scaling factor involved.

Speaker 1: 43:13

Right. So you don't have to learn this large scaling factor, which actually does not help in making decisions at all. Right. So let's say you are doing an argmax of your action value function to take decisions, If you have a large constant within it, it doesn't really affect which action is the best one. Right?

Speaker 1: 43:31

So your let's say your q s a is some large constant plus some actual differences between the things. The difference is what the argmax is going to consider. Right? Not the constant. So the difference is essentially what their differential value function is.

Speaker 1: 43:48

The constant is the average reward by 1 minus gamma. And so if you if your function approximation makes errors in learning this large constant, those errors might be any small differences within the part where the state action pairs actually differ can be masked by these approximation errors of this large offset. Right? It's literally an offset. You don't really care about it when making decisions.

Speaker 2: 44:19

Yeah. It's making it stable, robust.

Speaker 1: 44:21

But it's important information. Right? It the r pi like, average reward by 1 minus gamma has an information the long term information of the average reward. So you still want to learn that. So that's what reward centering enables you.

Speaker 1: 44:35

It separates out these two components of the discounted value function into the 2 parts and you so you learn the average reward, which is the how to behave in the long term and the remaining value function, which you call the centered value function, which is how to behave in the short term, and both of them capture all the information that is there. And now you can use all of your function approximation resources to just estimate the differences between the states. Right? What is more positive, what is more negative. So all of these are now centered around 0.

Speaker 1: 45:09

You can just use all your resources to learn this as opposed to this plus a large constant, which could really mess up your approximations. Right. And so, now, if you can learn these 2 things separately, you can form at if you have these two quantities. At any point of time that you ask me, I can give you my estimate of the discounted value function. Right?

Speaker 1: 45:34

I have my average reward estimate. I can divide it by 1 minus gamma, add the value function that I have learned and that's an estimate of the discounted value function for that state or state action pair. Right? But if you ask me a discounted value function for a different discount factor, all I have to do is subtract divide that average reward quantity by 1 minus gamma 2 instead of 1 minus gamma 1. Right?

Speaker 1: 45:59

And immediately I have an estimate of the discounted value function corresponding to this different discount factor. Right? Which is not what you would have in general when all of these things are combined into one thing, one big thing. Right? So this is this is what reward centering lets you do that you can you can add there is always within this idea that maybe you can adapt your discount factor over time.

Speaker 1: 46:24

With reward centering, you can now do it very quickly and easily. And perhaps now you can actually envision these type of algorithms being very sample efficient. So, yeah, your objectives is still to do as well as possible in the long run, which is the total reward or average reward objective and to enable that to enable doing this perhaps quickly, you can adapt your disjoint factor over time and maximize the average reward for the problems that you're interested in.

Speaker 2: 46:55

So it's kinda like you're learning all the gammas at once, sort of. Because you can estimate them all, which you can't do in a discounted when you set a certain gamma in a in a discounted case. Is that right?

Speaker 1: 47:06

You can estimate them all. So you have for any gamma that you ask me, I can give you an estimate of the discounted value function, but we can fine tune it with data. Right? It's not, it's not the best approximation if you had started off learning with that discount factor. Right?

Speaker 1: 47:22

But it is way better than if you had learned with something else and now in the standard case, you start, you learn with some gamma 1 and if I ask you what the estimate of gamma 2 is, well, you really have no idea. But if you're doing rewards hunting, I can immediately give you an answer and that is that is a pretty good answer, which you can make better with more data. So I wouldn't say you are learning all the discount factors together. You are learning it for a particular discount factor, but I can give you the an estimate for any other discount factor at any time, and I can improve that estimate quickly through data, a lot more quicker than I would have been able to do if I was not doing the word centric.

Speaker 2: 48:00

I wanna talk about your adviser because you had a very special adviser, for your PhD. You were supervised by the the legendary professor Rich Sutton. And I I don't think there's a bigger celebrity in the reinforcement learning world. Of course, he famously wrote Sutton and Bartow, the canonical textbook introduction to reinforcement learning, and originator of countless ideas. Can you tell us about about working with Rich?

Speaker 2: 48:26

What is it like to work with professor Rich Sutton?

Speaker 1: 48:29

Oh, it's it's wonderful. Yeah. As you said, it it has been a great privilege to get the opportunity to work with him. Right? Because he is yeah.

Speaker 1: 48:38

He is he's the clearest thinker that I know. Like, he can very quickly find the force in your thinking of, like, whatever line of direction of ideation you're going through. He can very quickly point out some jumps that you're making, any gaps that you're not seeing, and that is incredibly useful to, you know, iterate over ideas. And the best part is that despite being certain, he is very kind and generous with his time, which has been like, I could for most part of my period, I met him every week. Right?

Speaker 1: 49:12

And that is not what I expected when I started. Right? I was thinking he would barely have any time. Right? But no.

Speaker 1: 49:20

He he's he was very kind and generous with his time, like even now, And he will take the time to listen to you and very graciously and patiently point out where things can be improved. Right? Like, he can, like, he can okay. In other words, he can very easily expose all the flaws in your thinking, but he does it in a very gracious way. Right?

Speaker 1: 49:44

So you don't really feel bad. And then the also another good thing is that he can really get into the weeds with you, like, for a very particular, let's say, some algorithmic issue that you are running into, he can get down to that level with you and think things out, debug them. But he at the same time, he can also let you be completely independent. Right? Like, so he's not, like, micromanaging anything.

Speaker 1: 50:09

Like, if you don't want feedback from him for a couple of months, like, he's totally fine with that. Right? Like, he doesn't want constant updates or something. And so basically, there's this whole spectrum of supervision that you can ask you can get from him, which has been very useful to me because I, yeah, I like to take my time thinking about things. And I've also had this, like, the community that he that our lab has become with so many PIs that are thinking about RL and so many of their students, you can also ideate very quickly by just being in the lab talking to these people.

Speaker 1: 50:46

So it has it has been it has been a it was a great experience working with him, and, yeah, I I will hope to do that as as as much as I can even now.

Speaker 2: 50:57

Has working with him changed how you think or how you approach problems? Is there a way you could describe how it's changed your approach or or how some of his thinking has has rubbed off?

Speaker 1: 51:10

So, yeah, someone asked me this that what is one thing you've learned in your PhD. Right? And I I thought about it, and my answer was I learnt how to think from Rich. Right? And it's not just about research.

Speaker 1: 51:24

I think it just it goes it it generalizes to life in terms of how you think from first principles. Right? That you can't just start somewhere. Like, it's it's very useful to know what axioms you are building on and which one of them is makes sense more generally versus a very specific problem you're trying to solve. So you assume something and then you build on top of that.

Speaker 1: 51:51

And and, yeah, if you think about things from first principles, you basically know all the ways in which it makes sense and doesn't make sense because you thought it through from the start. Right? You just didn't pick it up somewhere in the middle. And I think that's, it's also obviously been very helpful in research. Right?

Speaker 1: 52:10

That, like as you mentioned, a lot of my work has been about fundamental things because I think there's a there's a lot of thought we can give to the fundamentals and it's it's not that everything is established and you are you can just think about the very much leaves of the things and not have to bother about the roots. There's a lot of interesting questions even at the roots and having to think about it deeply can be can give you a lot of rich insights about the leaves. Right? And I think yeah. I I also do that in my it's rubbed off in my non research y life as well that this way of thinking where you can you start from first principles and then the limits of what because these are perhaps in a topic, someone is saying very much new things and you're like, okay.

Speaker 1: 53:02

I don't understand this at all. So can you help me take a step back? Let's think about it. Like, help me understand what this is. Right?

Speaker 1: 53:09

And so it's it's very useful in thinking about my own research, thinking about going to when I go to talks, I can use that to think about what problems people are trying to solve, why are they important, what is the insight that they have, and more generally about things in life. Right? Like, there are arbitrary things that might seem arbitrary rules, like, we are searching for apartments and some leases have these arbitrary rules and you would think, like, why does this, like, that makes no sense. But then if you take a step back, you think of it in from the perspective of the building manager. Right?

Speaker 1: 53:44

Like, or the builder that you build this building and now there are some things that and now when you, as a buyer, you don't generally think about these things. Right? And so when you start thinking from their perspective, a lot of things start to make sense that, okay. Yes. It does sound unreasonable, but I would see why they would want this to exist.

Speaker 1: 54:02

Right? So, yeah, I think that's sort of yeah. I I think that is one major way in which I see myself before and after my PhD of how I used to think about things.

Speaker 2: 54:14

I got a little sense of that today, Abhishek, when you were telling me I was mixing up the problem, the definitions with the solutions and things like that, so I really appreciate that, and I I gonna think about that more, and every time I listen to Rich speak I, I'm struck by that that clarity, and I'm sure a lot of people feel that way. Okay. So let's move on to outside of your your, your PhD work, actually, and and your work, which which we could honestly talk about for hours, and we're just barely scratching the surface. But outside of that, can you tell tell us about other things that that you've seen happening in RL lately that you find, interesting?

Speaker 1: 54:52

There's so many fascinating things happening, especially in the generative AI space. Right? And so I I am mostly an observer there. I don't really work in that space. So but, yeah, there are many cool things happening.

Speaker 1: 55:05

Right? And as someone who is slightly related to all of these fields, I I'm amazed by the kind of things like ChatGapiti or Gemini can do in terms of image generation and, like, helping explaining certain topics to you. And, yeah, I've I've been using this for these things and it's been it has its limits, but it's been quite wonderful. And then I look forward to where this can go. Right?

Speaker 1: 55:31

Like, so some things that I wish existed and I hope maybe someone listening to this would take into account that wouldn't it be great if there was some generative AI tool where you could you could give it an alternate ending to a movie and it could create that for you, like in in terms of visuals or like I I love to read and there's a lot of books where it's every the premise is incredible. The way the characters were developed were incredible, and the story almost till the end was great. But then at the very end, for whatever reason, there is a deus sec machina somewhere. Right? Like, something a very convenient end to the story out of nowhere.

Speaker 1: 56:14

Right? And so then you feel frustrated. Like, I wish the ending was something else. Right? Not this.

Speaker 1: 56:19

And then you have some ideas for it. And so then so similarly for movies. Right? Sometimes you watch this movie and you're like, oh, man. Like, it was great till 90% point, but then some weird things happened.

Speaker 1: 56:29

Like why? I wish it was this way. So I wish there was a tool where I could propose alternate endings and it would just make it for me. Right. And then I can watch it and then that would be my satisfactory end to the book or the movie or the story or whatever that is.

Speaker 1: 56:43

Right? So, yeah, generative AI is very cool, the kinds of things it can do, and I hope it there are such tools which I can use in the future.

Speaker 2: 56:53

That would be fun. We could redo the Star Wars prequels. Make them good.

Speaker 1: 56:58

Oh, well, yeah. I mean, yeah, beyond generative, I do. I wish, we could crowdsource these things a lot more, right, that there are a few books that I've seen which have taken this principle that an author will post a chapter on their webpage and then people will read it and then they will, they will show, I wish it was this way, right? And maybe it can, you can make it a little bit better if the story slightly took this direction instead of this, right? And these like micro suggestions that people have, that then the authors integrate where they see fit and then the sum of the sum is way better than the individual parts, right?

Speaker 1: 57:33

So yeah, even crowdsourcing can achieve that to an extent, but I am not a crowd. I am one person and I would like to do this for myself too while I'm sitting in on my couch finishing a movie or a book. But people I also love talking to people about these things. So where there is a crowd, I would love to crowdsource an ending, but when I'm doing this by myself, I wish such a tool existed. Right?

Speaker 1: 57:55

It would be fun.

Speaker 2: 57:56

Okay. But let's move on to space. I know that's a big, interest of yours and, a focus of yours. I understand you're working in the space industry now. Can you tell us more about that?

Speaker 1: 58:07

Yeah. I specifically chose the position that I'm currently working in is because I could do AI research for in the space industry for space science and technology. Right? And because I've always like, space has been my hobby for a long time. My favorite genre in reading is sci fi.

Speaker 1: 58:26

It has been that way for many, many years and I would watch all of these documentaries about the Apollo program or the Mercury, Gemini and all of these things and I was always, like, it was amazing to see, like, all of these missions with 100 of single points of failure with incomprehensible complexity and like, and humans are coming together to pull this off, right? And it's just mind boggling, but I was also very envious that all of this happened during the space age and I'm reading and listening to all of these cool things, but, like, there's nothing like this for us. Like, there is but then, over the last few years, with the Artemis program of going back to the moon, it has now kick started this whole whole new fledgling space age, right, where there's these tons of startups doing very cool things about mining on the moon or, building, ways to collect space debris because we are going to have more and more of it with more and more constellations like starlings and whatnot. And, yeah, there's just so many interesting control problems to be solved here. And yeah, I really wanted to I thought it would be great if I can put together my my interest in AI and my interest in space together.

Speaker 1: 59:47

And, yeah, I'm incredibly happy to say that I found such a position where I can do these things together. And, yeah, it has been great so far, and I look forward to all the things that I can do over the next few years.

Speaker 2: 01:00:01

So what where do you see yourself in, like, the future? And, like, and then what what's kind of like a holy grail for you in in this work?

Speaker 1: 01:00:09

So there are many, many interesting spaces of application. Right? Like, but for me, yeah, for me okay, sorry, I'm going to start this over. So where I imagine this to be going well, to me, yeah, space is fascinating, right, and so while there are many diverse applications of where I could be focusing my interest in right now, it is space. Right?

Speaker 1: 01:00:31

And I would like to see how far I can go in terms of bringing AI research to the space industry. Right? Because as you can imagine, space is the most risk averse domain out there. Right. You cannot afford to make any mistakes because people can die or your equipment of like 1,000,000 or 1,000,000,000 of dollars can explore.

Speaker 1: 01:00:57

Right? And so, so they are very risk averse and learning is not something that they adopt as quickly as maybe a robotics warehouse at Amazon or something. Right? So, but then, at the same time, there are so many applications where you would benefit by learning. Right?

Speaker 1: 01:01:15

Because it's the usual thing. Right? There is there's a lot of interesting things that we can reason about and do through physics and the heuristics and about whatever we know about the problem, but we are still limited by how much data a single human or a set of humans over time can sort of process. Right? But computers can consider a lot more data, a lot more variables, a lot faster, and a lot more efficiently than any humans.

Speaker 1: 01:01:43

Right? And so there is a lot of scope for learning in these control applications within the space industry. And so, yeah, I'm very excited about a future where, for example, as my motivating example of my PhD defense, you have these overs autonomously and continuously continually operating in their environments, continually learning, adapting to the challenges that they face locally in their environments for the problems that they are trying to solve without little, like, little to no human intervention. Right? That would be wonderful and I want to see it at that scale as well as the smaller scale things.

Speaker 1: 01:02:28

Right? How to, if you are a satellite, your main source of energy is solar energy. Right? How do you optimally distribute this constantly changing amount of energy that you have into the different parts of the satellite? Right?

Speaker 1: 01:02:42

Like, you need communications with the Earth, you need to give power to your camera modules too, which are looking at monitoring climate change or weather patterns or whatnot. And, you need enough power to for your electronics and for even the computer who is making these decisions. Right? So how do you optimally distribute solar power within this, within a satellite, over time throughout the operational life cycle of the satellite. Right?

Speaker 1: 01:03:11

And so all of these are very interesting control problems from this scale all the way to operating rovers autonomously on a different planet. So, yeah, I'm excited to see all the new developments that happen and what role I play into making these things a reality.

Speaker 2: 01:03:28

Awesome. And as you can see from the news, that stuff is moving pretty fast. So you might actually give a little

Speaker 1: 01:03:33

Oh, yeah.

Speaker 2: 01:03:34

See your your work in in production, so to speak.

Speaker 1: 01:03:38

Yeah. I mean, that's I think, yeah, that's a very topical thing. So let me quickly pick pick on that. Yeah. I remember I launch I watched the first Starship launch in April of 2020, 2023.

Speaker 1: 01:03:50

Right? And and it sort of exploded within 3 minutes of its first flight and from that to today, within 18 months, we just take it for granted. Of course, it's going to clear clear the launch tower. Of course, it's going to deploy the the upper stage and, of course, it's going to do, like, a control landing of it and now it even caught it within its macassarachopsic. Right?

Speaker 1: 01:04:13

It's incredible and what Starship brings to the table is you have so much more, like, the rocket is just so big compared to whatever things we have had today that you can just take a lot more equipment for a like, 2 to 3 orders of magnitude less money. Right? So imagine a near future where where labs like the one that I came from can have their own Rover on Mars, for example. Right? Like, it's completely unfathomable today, but within a few years of things like Starship being operational, that cost will be, maybe, of $300,000, which is a large amount of money, but not too much for a decently sized lab.

Speaker 1: 01:05:00

Right? So you can afford to have experimental rovers on Mars where you can have these sorts of demonstrative learning based technologies, and you can use that experience to convince higher ups at Canadian Space Agency or NASA or wherever to have bigger missions where you have these learning technologies. Right? And so I am very excited of where we are heading with the current innovations in space, science, and technology, and I'm very even more excited about what comes next in terms of learning.

Speaker 2: 01:05:36

Okay. Speaking of that catch, the Mechazilla catching the, super heavy booster, do you think or was RL used in in network? Or how do they do that?

Speaker 1: 01:05:46

Yeah. I think, so I think nobody knows except them because it is yeah, it would be a big trade secret. Right? It's it's completely bonkers that you can even do something like this. So I'm pretty sure that they are protecting it very tightly.

Speaker 1: 01:06:01

So, yeah. To answer your question, I don't really know what technology they are using, but I would think that

Speaker 2: 01:06:07

I mean, it's a hard control system problem. Right? Like a very hard and so so they I mean, what what would the what are the other options be, like, classical control?

Speaker 1: 01:06:15

Most things in the space industry are classical control. Right? And and it is a very well developed field, right, there's been decades of work done along those lines and and plus the people working in these sorts of applications are some of the smartest people around, right, and not just scientists, but space is a very engineering first discipline, right, and a lot of problems you can make, you can engineer your way through the problems, right. So, for example, like, SpaceX is not solving the problem of landing this rocket anywhere or, like, at any set of chopsticks or any landing pad. Right?

Speaker 1: 01:06:58

They have specifically designed it. The like, they are controlling both sides of the iteration of the launch pad and the rocket. Right? So they are designing most of the complexity of the, for example, the catching in the in the launch pad. Right?

Speaker 1: 01:07:15

So from some of the previous SpaceX rockets you've seen, you have those landing legs. Right? So the complexity of landing is more on the side of the rocket, but now they are pushing it to the side of the launch pad. Right? So that means the rocket can be more lightweight.

Speaker 1: 01:07:30

It can carry more load instead of its own heavy metallic legs, but then yeah. So now there is an engineering solution to to land to make a reusable rocket that you could make a general thing that can land anywhere if it had legs to a more specific thing that can land at SpaceX launch pads because it has these arms and the rocket has certain these notches, which are the load bearing things that will actually be caught on the arms. Right? So instead of making huge legs to take all the load, you just have these small notches and the chopsticks are doing most of the lifting. Right?

Speaker 1: 01:08:07

So they have engineered away quite a bit of their reusable rocket problem into the launch pad.

Speaker 2: 01:08:14

So you're saying because they they constrain they constrain the problem, therefore, it's more amenable to classical controls and and and maybe RL would be less applicable or required. Is that what you're

Speaker 1: 01:08:25

saying? There are many clever people finding ingenious ways to solve a problem that they have. Right? A given problem that they have, and the solution can come in that, forms of some engineering innovations and then some scientific innovations and they work very tight hand in hand. There are, of course, components of it, which, like, the control part, that's a black box at the end of the day.

Speaker 1: 01:08:53

Right? What algorithm it is using can change and, over time, I don't know if they are currently using learning based approaches, but, perhaps, they will test and see that, yo, there is some benefit to doing learning based approaches where things are more unpredictable or there are more things to adapt to because things are not very well known in terms of their complexity or their physics. And so, for all we know, they are using some learning based approaches now. But I think in the future, we will see a lot more of it as the trust in these systems grow through small scale demonstrations all the way to landing a 275 ton 20 floor skyscraper from space onto set of chopsticks. Right?

Speaker 1: 01:09:43

So it's just incredible.

Speaker 2: 01:09:45

I mean, it looks just like lunar lander, the classic ROI that's scaled up. Right? It looks just so much like that.

Speaker 1: 01:09:51

Yeah. But, yeah, I mean, you do all those things in simulation and then you want it to work as well in real life. Right? And so yeah. Lorenz is great.

Speaker 1: 01:10:01

And, yeah, I mean, I was just having this conversation the other day about wouldn't it be great to have this chopstick landing problem as a gym environment? I would love to try my hands on it. And, of course, some people have worked on it and there are some prototypes of it and yeah. Yeah. I look forward to these things in the future.

Speaker 2: 01:10:18

You've spoken to me about, your interest in in in the brain and how the mind works and that you lead a reading group on this. Can you share some some, some ideas from that space and how maybe how they link to your thoughts on AI?

Speaker 1: 01:10:33

So, yeah, when I joined the university, we had there was this reading group called making minds reading group. Right? And it was basically books about anything related to AI. Right? And so it's a very general thing, and so we ended up with lots of books from related fields, which are studying a similar problem of how the mind works, right?

Speaker 1: 01:10:54

So it's cognitive psychology, neuroscience, ethology, behavioral economics, and what whatnot. Yeah. Because at the end of the day, these are different disciplines, but they are they are studying a very similar problem, right, how the mind works or how the mind could work, and, essentially, we are doing the same thing. So in this in these groups, we would read the book over the month and come discuss what it had, how it is related to the research that we are doing or should be doing and whatnot, and, yeah, it it has always been this like, I've always been fascinated by by these interdisciplinary perspectives of how the mind works, right, and, of course, the history of RL is tightly linked to these disciplines, right, like, some of the initial algorithms were inspired by work in cognitive psychology, but then RL took its own form. It just become more computationally focused within those constraints as opposed to constraints of neurons and whatnot.

Speaker 1: 01:11:51

But then again, people have started making connections back, right, that seeing that, oh, some of these computational algorithms proposed in the literature of RL could potentially be what the brain is doing in terms of as a learning mechanism. Right? So there are people working at the intersection trying to figure these things out. So it's it's a very rich very rich cross fertilization of ideas, which I I love to see. And, yeah, as a more recent anecdote, I have learned that there was a recent Nature neuroscience paper along these lines of there are we have a lot of data about how, like, behavioral data as well as neuroscience data of from animals and from humans, and people try to apply RL based models to that data to see if it fits.

Speaker 1: 01:12:42

Right? And it fits some areas quite well. Some other areas are sometimes missed out. You try to fit those, then you miss out on the initial ones. It has it's been going on for the last few years, decades, and now this paper came out, and it it proposed that, oh, maybe we have a contender a new contender of an algorithm, an RL algorithm, which is physiologically plausible in the brain.

Speaker 1: 01:13:07

And this algorithm turned out to be the differential TD algorithm that we have discussed in this podcast as part of my PhD work. Right? So I was

Speaker 2: 01:13:15

No way. Like

Speaker 1: 01:13:17

exactly. I was yeah. I I couldn't believe my eyes, and at first, I was like, yeah. I probably don't understand this. Right?

Speaker 1: 01:13:23

Because it's a neuroscience paper. There was a lot of language and terminology that I don't understand and so I had the pleasure of meeting Sam Gershman, who's the one of the authors of the paper and, we had this conversation where he tried to dump it down to words that I could understand And then in return, I could sort of share some of the more computational side of the ideas, and then we've been talking about that recently. And, yeah, I'm very excited to see where to the extent to which we can explore this connection between my completely unrelated to neuroscience computational idea of, like, your our differentiated learning algorithms, how it relates to what is plausible in the brain and what sort of data does it explain and whatnot. So, yeah, I'm very excited to explore that connection, but yeah, like it was a nice full circle moment for me where I started learning about TD and its inspiration from, these related fields, and then I did all of this work in RL all this while while reading about how octopuses make decisions or how ants make decisions or how rats make decisions. Right?

Speaker 1: 01:14:35

And then now maybe I can also contribute to the intersection or connections between these fields. I think that would be that would be really great. Yeah. It's very exciting developments and I I look forward to seeing, yeah, understanding more of what exactly that line of research is and then, yeah, seeing if there is something I can contribute to it. I look forward to all that.

Speaker 2: 01:14:59

Now, looking back to the beginning, I wanted to ask you about how you originally got into ML and RL, in the first place.

Speaker 1: 01:15:06

Yeah. It has been a fun journey. So I back in my undergrad in my 2nd year summer, I interned I had the opportunity to intern at Amazon, and there I worked in the Kindle team, which was which was which was great for me because I love reading. Right? I had anyway.

Speaker 1: 01:15:23

But there we there was quite some machine learning involved. And, of course, I didn't know anything at that time, but I learned some things from my team on the job, and we did some cool things there, which I can talk about later. But, yeah, coming back to university, I wanted to learn more. Right? Because there were so many unknowns and it seemed really cool.

Speaker 1: 01:15:44

So I took this course by a professor at my University at IIT Madras, by Balaraman Ravindran and, yeah, it was an intriguing course, all of like supervised learning, unsupervised learning and in his last lecture, basically, he was advertising his RL course, where he said, we learned about all of these interesting things about a supervised and unsupervised learning, but think about it, like, when you are learning how to ride a bike, no one really tells you that when you're inclined 3 degree to, the horizontal oh, sorry. To the vertical, apply 40 newtons of force on the right side and that's how that's not the kind of data you have. Right? Like, you just try something out. You fall down.

Speaker 1: 01:16:26

It hurts. It doesn't work. And you're like, okay. I should try something else. You try something else.

Speaker 1: 01:16:31

You feel the wind in your face. You hear your parents clapping and then you are like, oh, I should do this more. It works. And you positively reinforce those actions. Right?

Speaker 1: 01:16:42

And that is RL. Right? And come learn about it in the next course. And my mind was like, wow. Right?

Speaker 1: 01:16:48

Yes. Isn't that how we do most things? I want to learn more. Mhmm. So that's what I did.

Speaker 1: 01:16:53

And then I did my master's on deep reinforcement learning and that was very, very hard because I barely knew RL. I barely knew deep learning and I was putting them together and but I still really liked both of them and I thought, yeah, I liked RL more than DL and so I wanted to take a step back, focus on the fundamentals to see, like, how does this actually work? Why should it work? And then then scale it to the applications where you would use the ideas from deep learning and whatnot and and of course, where better to learn the fundamentals of RL than at University of Alberta with Ritz Carlton, right? And so that's whom I applied to and I was very fortunate that I got accepted and yeah, that was my the start of my journey to machine learning, reinforcement learning, and where I ended up today.

Speaker 1: 01:17:48

All thanks to professor Ravindra.

Speaker 2: 01:17:51

Is there anything else you wanna share with the, the audience while you're here?

Speaker 1: 01:17:54

Yeah. I would like to share share a couple of things. Yeah. We covered a lot of technical content for this podcast, but definitely, if you want to learn more about these new and quite technical topics, I would highly recommend looking at the actual papers, right, like at the equations because it like having maths with you makes, side by side makes these concepts a lot more easier to digest. But, yeah, if there is anything new or interesting that you've heard today, I think I would highly encourage you to look at either my dissertation or, the papers that we've been talking about.

Speaker 1: 01:18:30

And, yeah, of course, feel free to reach out to me if you have any specific questions that you think I can answer. And, yeah, in general, just stay stay curious and have fun.

Speaker 2: 01:18:40

So you started your dissertation with a quote by Zig Ziglar, which I really liked. And I I I could kinda read this quote in a in a personal way or even in a RL Asian context. So the quote is, what you get by achieving your goals is not as important as what you become by achieving your goals. Can you can you tell us why you choose this quote and and and what it means to you?

Speaker 1: 01:19:06

Yeah. I, yeah, I had a I had a list of quotes that I had to I had shortlisted and, yeah, this was the one I ended up picking because precisely for the reason that you mentioned. Right? That it speaks to me personally and there is some from the point of view of an agent as well because at the end of the day, we are also learning agents, navigating our world, figuring out the best thing to do, and it's not always the goal, like, the the goal is important. It helps direct your attention towards it helps direct your efforts into a meaningful way.

Speaker 1: 01:19:40

And so achieving goals are very important, but the journey is also something you should keep in mind. Right? Like, that is also an important part and what you like, goals will continue changing and evolving over time, but how you do it and the way you do it and the things you learn, the people you interact with, those are those are very underrated, but equally are more useful things to have in life. And, yeah, that's that's why I ended up choosing that quote.

Speaker 2: 01:20:09

I I've got I finally got to meet you after spending time with you at your Amy Tea Time Talks, which I enjoyed so much, and I highly recommend people during the summer. Amy is generous enough to have these open sessions where outsiders can join their their conversations and their presentations, and and also crossing paths with you online at at conferences and finally meeting you in person at RLC, and then to do this this interview with with you today. It just means so much to me. So thank you, Abhishek.

Speaker 1: 01:20:39

Yeah. Thank you for your kind words, Robin, and, yeah, I look forward to the many more podcasts that you host and, yeah, I've learned a lot about many cool anecdotes about researchers and their research topics through your podcast. So, yeah, keep up the great work.

Speaker 2: 01:20:54

Love it. Okay. Well, Abhishek Naik, I've been looking forward to this for so long. It's been an absolute honor, and and it's it's quite a neat coincidence that I'm actually doing this interview sitting in your own home state of Goa, India. Well, I'm I'm traveling with my family right now, which I didn't realize was your home state, and I'm getting to speak to you after doing this, which I think is gonna be historic in retrospect, gonna be historic work that you've done.

Speaker 2: 01:21:18

It's it's been fantastic to do this interview with you, and I'm so glad to have you on the show and that our listeners get a chance to hear directly from you. Thank you so much, doctor Abhishek Nayak.

Speaker 1: 01:21:28

Thank you so much, Robin. The pleasure is all mine.

Speaker 2: 01:21:33

Okay. Well, that's a wrap.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

Abhishek Naik on Continuing RL & Average Reward

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere