TalkRL: The Reinforcement Learning Podcast | Transcript: Karol Hausman and Fei Xia

Karol Hausman and Fei Xia

August 16, 2022 / 01:03:09/E36

Karol Hausman and Fei Xia of Google Research on newly updated (PaLM-)SayCan, Inner Monologue, robot learning, combining robotics with language models, and more!

Speaker 1: 00:00

This type of emergent capability is super interesting for us to see and super exciting for us.

Speaker 2: 00:07

By using a better language model, we can improve robotics performance kind of for free.

Speaker 1: 00:14

TalkRL.

Speaker 3: 00:16

TalkRL podcast is all reinforcement learning all the time, featuring brilliant guests, both researched and applied. Join the conversation on Twitter at talk r l podcast. I'm your host, Robin Chohan. A brief message from AnyScale, our sponsor for this episode. Reinforcement learning is gaining traction as a complementary approach to supervised learning with applications ranging from recommender systems to games to production planning.

Speaker 3: 00:45

So don't miss Ray Summit, the annual user conference for the Ray open source project, where you can hear how teams at DAO, Verizon, Riot Games, and more are solving their RL challenges with rllib. That's the RAID Ecosystems Open Source Library for RL. Ray Summit is happening August 23rd 24th in San Francisco. You can register at ray summit.org and use the code ray summit 22 r l for a further 25% off the already reduced prices of a $100 for keynotes only or a 150 to add a tutorial from Sven. These prices are for the first 25 people to register.

Speaker 3: 01:22

I can see from personal experience, I've used Ray's RLLib, and I have recommended it for consulting clients. It's easy to get started with, but it's also highly scalable and supports a variety of advanced algorithms and and settings. Now on to ARPSTEN. Carol Hausman is a senior research scientist at Google Brain and an adjunct professor at Stanford working on robotics and machine learning. Carol is interested in enabling robots to acquire general purpose skills with minimal supervision in real world environments.

Speaker 3: 01:52

Faye Sha is a research scientist with Google Research. Faye Shah is mostly interested in robot learning in complex and unstructured environments. Previously, he's been approaching this problem by learning in realistic and scalable simulation environments, including Gibson, Env, and I Gibson. Most recently, he's been exploring using foundation models for those challenges. Thank you both for joining us today.

Speaker 2: 02:15

Thank you for having us.

Speaker 3: 02:16

Thanks for inviting. So I reached out to you, about an interview on because I wanted to hear more about how you combine these different lines of work with language models and robotics. And I think it's really cool that you focus on this very challenging and practical domain and got some really interesting results. So let's get started with SACAM. So that paper is entitled do as I can not as I say grounding language in robotic affordances.

Speaker 3: 02:42

That's Anh et al in 2022. To start with, could you give us a a high level idea of what is SACAN about?

Speaker 2: 02:49

Yeah. So SACAN is about, allowing robots to execute long horizon abstract commands that can be expressed in natural language. So it would the the goal was to allow users to just talk to a robot and describe what they want. And even if the task is very long and it would be very difficult for robots to to execute it, we've thought that by leveraging large language models, we should be able to break a task down into smaller set of steps that the robot is actually capable of doing and then helping the user with executing the instruction. The high level idea behind that is that we want to combine large language models with robot learning in a way that, they benefit each other.

Speaker 2: 03:34

So large language models in this equation provide the semantic knowledge that is in them. So they they know quite a lot about the world from looking at text that is all over the Internet. And, at the same time, they can also understand what you what you mean exactly, and we can also use them, to break down tasks into smaller steps. And then on the other hand, robotics can be used to ground these language models in the real world. So the way that lung large language models are trained is such that, they don't really get to experience what these words actually mean.

Speaker 2: 04:08

They just get to read them and kind of, learn about the statistics of which words comes after other words. And we we were hoping that robotics can provide this this actual experience of what it means to do something. What what does it correspond to in the real world? So here, the the high level idea was that the the robots would provide this kind of grounding through affordances so that the 2 combined together, LLMs and robots, can execute on horizon tasks.

Speaker 3: 04:36

And you use that phrase phrase affordances. What what does that mean here in this context?

Speaker 2: 04:41

Yeah. In this context, we, refer to affordances as something that is aware of what the robot is capable of doing in a given situation with a given embodiment. So for instance, if you ask a robot to bring you a Coke, it should be able to tell whether it's actually possible to bring you a Coke, whether it has the right gripper or the right arm, whether even it has an arm so that it can bring it, whether there was a coke that it can see, whether it's, I mean, some room that, you know, there's no cokes to be found. So affordances is would be something that tells the robot what's currently possible given the state of the environment and to the robot's embodiment.

Speaker 1: 05:19

I want to briefly add that the the concept of affordance comes from, like, American psychologist James j Gibson in the ecological approach to visual perception. It means what the environment offers individual. So in this case, it means what the robot can do in in a certain state. So that's what we mean by affordances.

Speaker 3: 05:39

There's this notion of using language models for scoring these these affordance phrases if I can call them that can you can you talk about how that works this is using a language model in a in a different way than I'm used to seeing not for generation but for scoring

Speaker 1: 05:53

So generally, there are 2 ways that you can decode from a language model. One way is called generative mode because language model essentially predict the probability of next token conditioned on previous tokens. So if you just sample from the probability, then you're doing generative mode. You can do greedy sampling, or you can do have some temperature and do, like, more diverse sampling. There's another way, basically, you force the output to be the phrases that you want, and then you calculate the likelihood.

Speaker 1: 06:25

That is the scoring mode. The reason that we use scoring mode is mainly to constrain language model to speak our robot language. So our robot only have a certain set of skills, So we want to constrain the output to be from the set of skills, and we want to compare the likelihood of different skills. Through our experiments, we have compared the generative modes and scoring modes. In general, we find scoring mode to be more stable.

Speaker 1: 06:54

And we we we also tried generative mode. In the generative mode, you generate some, like, arbitrary phrase, then you still need to find the nearest neighbor of the robot skill. There are some additional arrows introduced in this mapping stage. So, through the experiments, we find the scoring mode seems to be more more stable.

Speaker 3: 07:14

And what is the state of the art in this area before say Secam?

Speaker 2: 07:17

Yeah. I think there's a few works that talk about using LLMs as zero shot planners. There is the original GPD 3 paper that talks about the the, capabilities of language models as as meta learners. There's also the paper from, Wenlong Kwong et al, that came out at around similar time that talks about using language models as zero shot planners. These don't have don't talk about real robot results yet.

Speaker 2: 07:45

And, they have been showing, I think, some glimpses of what LLMs are capable of in terms of, just meta learners or as planners that could be applied to something like robotics. And I think that the other body of work that started being quite popular lately is just using language as a conditioning mechanism for policies for robotics. An example of work like this would be BCZ by, Eric Jang and Jang and others where you still use a large language model to find the, embedding of the instruction that you're sending to the robot, and that allows you to leverage natural instructions as well. But it doesn't really extract the high level knowledge from LLM the same way that Seiken does does where it can kind of contextualize it based on everything it learned.

Speaker 3: 08:36

Tell us more about this robot. What what is it what is it capable

Speaker 1: 08:40

of? So the robot that we use is a mobile manipulator from everyday robots. A mobile manipulator means it can navigate around and it also have a arm that can manipulate things. It, so in this work, we mainly use the vision system. We use the the camera images which is 6 40 by 512 RGB images as input.

Speaker 1: 09:00

The robot has 7 degree of freedom arm with a 2 finger gripper attached to the end, and then we mainly use that for manipulation. Finally, it has a navigation stack that, it's based on wheels. It can drive around in the scene, without collision. So that's basically the robot that we use. I want to highlight that the the mobile manipulation is a big challenge because, you need to decide, like, where where to stop to enable manipulation.

Speaker 1: 09:27

So if you stop too far, then you you're not able to manipulate. So generally, it's a very difficult problem, for us to solve and, the robot platform enables us to do that.

Speaker 3: 09:37

You have taught this robot a set of skills each with their own value function and these this is kind of like a pre training step. How how did you train these these skills, and what kind of skills are we talking about?

Speaker 2: 09:49

Right. This is a good question. So, at the time when when we published, this was, I think, around 300 different task constructions, that included fairly simple skills such as picking, placing, moving things around, placing things upright, knocking things over, things like that. And we'll be updating the paper very soon where we'll be adding additional skills, such as opening and closing drawers and and putting things in and out of them. So this is kind of the the level of skill complexity that we introduced.

Speaker 2: 10:24

In terms of how we train these skills, this is, I think, the where majority of the work goes, and this is the the really, really hard part, of robotics. How do you actually get the robot to move and do the thing that you want it to do? So we use the combination of behavior cloning, as well as reinforcement learning. In in this case, in particular, we are constantly comparing the different methods and how they scale with data and how they scale with the amount of, demonstrations we have, the amount of data collected autonomously, whether we can leverage simulation data as well. So this is constantly changing as to, you know, which method is winning.

Speaker 2: 11:06

At the time of the release of the paper, all the policies that were used on the real robot were trained using behavior cloning, and then we used the value functions that were trained in simulation that were leveraging all the simulation data. Simulation was then transformed to look more realistic using cycle again so that the the images reflect a little bit better what the real world looks like, and we were getting value functions from those.

Speaker 3: 11:30

So where did the idea, for CKAN come from?

Speaker 2: 11:33

For us, we started with trying to incorporate planning to do more long horizon tasks first, and we're thinking of all kinds of planners and different ways of of thinking about the representation of the low level skills and so on. And, as we were kind of looking into that, in the meantime, Brian and and Faesha noticed that language is a really, really good interface for planning, and, it's not only a really good interface that kind of allows you to compose these different plants in many different ways, Very compositional. But it also allows you to then leverage large language models, which is kind of like a planner in in disguise that you can just take from a from from another lab that has trained it and and just use it and see how how well it works for your robotic domain.

Speaker 1: 12:23

Yeah. I think during that time, we also there is also a plethora of work that, discussed using language model as 0 shot x where x could be, like, 0 shot reasoner, 0 shot planner, 0 shot learner. And we were just thinking, what if the x is robotics? Like, what we can, extract, now, like, what knowledge can we extract from large language model and apply it to robotics? Apparently, when we talk to a language model, it produce something that is reasonable.

Speaker 1: 12:52

For example, if you say, I spill my drink, it will say find cleaner, find vacuum. It's not absolutely right. It's not actionable, but we find that the knowledge is there. So the question then becomes, how do we make those knowledge more actionable? And then that kind of inspired the work of SACAN.

Speaker 3: 13:10

Okay and then just one more definition can you clarify what you mean by grounding in this context exactly?

Speaker 2: 13:16

Specifically in SACAN, grounding refers to this idea of affordance models that are that allow you to predict what is the success rate of doing a certain task given given a current state. What's the what's the probability of you succeeding at the task given that you're starting at a particular state and given the description of that task. More generally, the the idea of grounding basically means that the the LLMs don't really know what what the words that they're operating with, the what they what they actually mean. They're kind of just like pirates that memorize different statistics. And robotics kind of allows them to associate these word these words with with real world experiences.

Speaker 2: 13:56

So they kind of ground them into in in real experiences so that the robot actually it's or the the whole system actually knows what it means to pick something up or to drop something, what it what it feels like and also what it what it looks like. So it's much more grounded knowledge.

Speaker 3: 14:11

And I see that CKAN turns a natural language request into this list of steps that that, corresponds to the set of skills that it already knows. So like if this human said, how would you get a sponge from the counter and put it in the sink? Then the robot comes back with a list of steps. 1, find a sponge. To pick up the sponge 3 go to the sink etcetera.

Speaker 3: 14:31

How do you get a general language model to come back with such a specific list?

Speaker 1: 14:38

Yep. Maybe I can speak to this question. So the way that we, get the language model to produce the steps is the following. So first, we use few shot prompting. So we already show the language model a couple of examples of human ask a question, and the robot will list 1, 2, 3, 4, 5.

Speaker 1: 14:56

What are the steps? So the field shot prompting generally get the structure of the answer correct. So then every time human ask a question, the robot will answer, like, 1, 2, 3, 4, 5. That's the future prompting part. Then, we also have the scoring based decoding mechanism.

Speaker 1: 15:15

So we basically have a question, human ask a question, and then we have robot says 1 and 1, and then we leave a blank there, and we put all possible actions of the robot can do, and then score different options. So for example, when the question has a sponge in that, every option that also contains sponge will have higher score. That's because just how general language model works, it scores relevance. So that's the language part of the decoding scheme. We also have another branch that predict affordances, like what is the success rate of you find a sponge here?

Speaker 1: 15:53

What is the success rate of you pick up the sponge here? We multiply the 2 scores, the language score and also the affordance score, and we get a combined score. Then all the options get a combined score, and we just choose the highest, combined score. In this case, the first step would be to find a sponge, and then we repeat this process and append, like, 2 blank, and then, ask the language model to score at the second step. We repeat this process until it outputs a done token, which means the entire task sequence is finished.

Speaker 2: 16:27

Maybe to add a little bit to this, I think one one aspect that at least I didn't realize initially that we started noticing once we, you know, deployed the system is how interpretable it is. So you can very easily visualize what the language model thinks and what the robot thinks, what the affordance model thinks. And then you can kind of, look into the robots to see, you know, whether the affordance model didn't have that many successes in that particular scenario. And because of that, it downgrades that particular skill or whether the LLM makes a prediction that doesn't really make sense. So you can kind of quickly take a look and see exactly how the algorithm is progressing and why you picked that particular step as opposed to another one.

Speaker 3: 17:12

So when when the humans asked for, how to do something and the robot I understand for the on the first step, the robots gonna answer based on the existing context and then when it goes to say the second step is this the language model? Knowing that after the step one is done from general knowledge, what makes sense to do next and then combine with the scoring? So it's really leveraging the general knowledge embedded in the language model to order these things correctly. Is that right?

Speaker 2: 17:40

That's right. Yeah. Because the as you execute it, the steps get appended to the to the prompt. So then the language model kind of knows that I already found this pawn. Now I should probably pick it up.

Speaker 2: 17:51

And then it uses the affordance bounds and all of that to to score it.

Speaker 3: 17:54

Right. That's very interesting. So it kind of just emerges this ability to, to chain these things together just emerges from that from that language model.

Speaker 2: 18:03

That's correct. Yeah.

Speaker 3: 18:05

I mean, there's I guess, there's really good things about that and then some negatives too. Right? Like, it would be it would be challenging to teach it or correct something in in its planning. Is that is that one possible limitation here?

Speaker 2: 18:16

Yeah. That's a that's a very good point. So, here we are kind of diving into the world of planning. That is a very vast world with, you know, many different options where you can do myopic planning, nonmyopic planning, feedback, and all kinds of different things. And I think SACAN is just a very first step showing how you can do how you can make these open loop plans that are very myopic.

Speaker 2: 18:38

So, like, for instance, if you failed at step number 2, the language model doesn't really have any feedback mechanism in second to realize that, well, I failed. I should probably try it again or, you know, try to fix it somehow. Some of the some of these things we are starting to work on, how to make it a little less myopic and kind of optimize all the steps so that you can you can kind of look a little bit more into the future and think about your your plan more holistically. In the follow-up work on on inner monologue that I think Faye could talk a little bit more about, we also try to incorporate some some feedback into the into the planner and into the language model so that we can correct these these missteps if they happen.

Speaker 3: 19:17

Cool. I'm looking forward to talking about that as well in just a few minutes. In terms of SACAN, were there surprises for you in in doing this? And were you like pretty sure this would work from the beginning or were you did you have a lot of uncertainty about certain parts of it?

Speaker 1: 19:31

Yeah. I think that's a super great question. At beginning, we we are not sure if that's gonna work. Like, we just tried a few examples and it works surprisingly well. For example, it start to work for if I say throw something away, then it understand you need to go to the trash can.

Speaker 1: 19:48

So I think that's kind of interesting. Another interesting, like, kind of a moment for me is when we say I spill my drink, can you help? And the robot goes to find a sponge. So that's super surprising to me because, in order to do that sort of, inference, you need need to understand a lot of world knowledge. You need to understand that if I spill something, that means there's liquid.

Speaker 1: 20:12

If there's liquid, a sponge can absorb a liquid, then the robot should find a sponge. So I always think that the sponge was kind of a moment, that, super, super surprising, and this kind of emergent capability is super cool. So that's one thing that is surprising to me. Another thing that is kind of surprising to me is how well things scales. So in the, the paper that we are about to update, we talk about palm Seikan.

Speaker 1: 20:40

So previously, Seikan was using a language model called FLAN, which has about, 137,000,000,000 parameter model. When we switch it to a larger language model, which is pathway language model, which has 540,000,000,000 parameter model, then it solves a lot of the planning mistakes that we are seeing at a smaller scale. So it's really surprising that just by scaling the language model that we are solving a lot of, like, planning errors at this common sense reasoning problems. One particular thing that is super interesting is that language models historically don't handle, like, negation very well. If you say, I don't like Coke.

Speaker 1: 21:20

Bring me a drink. It will still bring you a Coke because the Coke, has, it has Coke in the context. It has Coke in the previous sentence. So the relevance, just makes the score of the Coke higher. With the new, palm sake can, we find we can do a sort a technique called chain of thought prompting.

Speaker 1: 21:41

So basically before generating the plan, the robot also generates a chain of thought. With a chain of thought, it handles, like, negation super well. It can say, the user doesn't like Coke, so I will bring something else. Sprite is not Coke, so I will bring Sprite. And then it will generate a plan to bring a Sprite instead.

Speaker 1: 21:59

So this type of emergent capability is super interesting for us to see and super, exciting for us and surprises us a lot.

Speaker 2: 22:09

Yeah. I think for me, the the big surprise that I wasn't expecting was this this, this ability for the robotic system to scale with better language models. I think this is super exciting, because as we know, there's many, many people, many researchers working on getting LLMs to be better and to scaling them up, to just improving them constantly. And just seeing that by using a better language model, we can improve robotics performance kind of, you know, for free by just swapping it out without changing anything on the robot. I think that is that is really exciting.

Speaker 2: 22:43

That kind of allows us to ride that wave of better and better LLMs and improving road performance that way.

Speaker 3: 22:51

So we've been seeing bigger and better performing LLMs. I'm not sure what hardware they run on, but I'm assuming they don't run on the robot. Is that right?

Speaker 1: 22:57

That's right. They they run on TPUs, and we, we call it through some sort of bridge.

Speaker 3: 23:03

What is the latency look like there? Just is that a limiting factor at all or is it is it pretty fast?

Speaker 1: 23:08

Yeah. That's a great question. So for some part of the robot system, it's latency sensitive. For example, if you're doing grasping, it's super sensitive to latency. Like, if you miss one step, then you're getting, like, outdated data and then, it can mess up the manipulation.

Speaker 1: 23:23

Mhmm. Fortunately for the planning is not a time sensitive Mhmm. Step, like the robot can just stop and think. It can tell people, and I am re, doing the inference. So, in terms of the latency for the latest palm, say, can, we are seeing about 3 seconds of latency in terms of reasoning.

Speaker 1: 23:42

So it's actually not too bad. Usually, each steps takes about, like, 30 seconds. So it's not bottlenecked by the inference speed of the planning.

Speaker 3: 23:52

And then the value functions are they running locally as well or they're just fast anyway

Speaker 1: 23:56

yeah the the value functions are super fast they they can run couple Hertz So that's not a bottleneck.

Speaker 3: 24:02

I yesterday, I told my wife and my mother-in-law about the interview, today and about the robots. And and they were they were excited. They asked me, well, what can this robot cook? And I I had to explain that, you know, robotics is really hard and, you know, it's not at that state of the art is not there yet there's no feeling of say can that's just how the field is right now but what's what should I tell them in terms of when when we might expect a robot like this to to cook us a meal, which sounds like pretty far off, but, but maybe not the way things are going.

Speaker 2: 24:32

Yeah. I think it's really hard to make predictions about about things like that. I think we're making quite a lot of progress, but it's kind of difficult to foresee what are the challenges that we'll we'll see with with getting there. One one thing that I tend to mention when when I get asked by my family questions like that is the Smorevex paradox where, in AI, it's the the easy things that are hard, and it's the hard things that are relatively easier. So the the things that seem very easy to us, such as manipulating objects or cooking a meal or just walking around and, you know, playing with toys and things like simple manipulation skills only seem easy because we've been doing them for thousands and thousands of years.

Speaker 2: 25:13

And the evolution just made it so that it just seems extremely easy to us, versus the the other things that require more reasoning, so things like mathematical equations or playing complex games, things that we would usually consider the intelligent things. There, we we haven't been doing them for that long on the evolutionary scale. So robots are the the algorithms don't have that much to to catch up on. So if you like the the embodied version of AI where we actually have to manipulate the world around us and understand what's happening, this is the really, really hard part. And it's kind of difficult to, to get that across sometimes because, you know, it just seems so easy.

Speaker 2: 25:54

Like, I can cook a a meal very easily or even a, you know, even a a small kid can kind of has has manipulation capabilities that far exceed what the robots can do today.

Speaker 3: 26:05

Okay. I'm gonna try to explain that to them. Thanks for that. Okay. And then in terms of, the idea of end to end learning versus this compositional style where you're putting together pre built components I'm curious how you how you guys see that like it seemed that at some time you know, some people were really extolling the virtues of end to end deep learning.

Speaker 3: 26:28

But then more recently, these foundation models have become really popular where there's a lot of pre training involved and we don't expect to to learn end to end or if if at most of it of fine tuning. Do you think the future is gonna involve a lot more of this pretraining and composition the way we're seeing here?

Speaker 2: 26:45

Yeah. That's a that's a really good question. Looking back at how robot learning has evolved, it seems that initially, we started with things that are a little bit more modular. They're a little bit easier to implement, a little bit easier to understand, and they kind of make a lot of sense initially. And then as we get better at them and they start to work, we put them into this end to end system that optimizes only for the thing that we care about, and then it finds the right representations to communicate between these different components.

Speaker 2: 27:14

That that happened in the past, for instance, with perception and control where you would have a perceptual system that would, for instance, estimate the the pose of an object or recognize the object and so on. And then we'll take that representation that we came up with, and feed it to a to a controller. I think that right now with the with the language models, with these planners, we are going through a similar cycle where we are at this very initial step where it's much easier to think of them as these modular components where you have a separate planner that just gives you the next set of steps to do. And then you have a separate end to end in this case, but separate, controller, closed loop controller that can take a short command and execute it. But I think over time as we kind of start to develop them more and more, they'll become more unified and more end to end.

Speaker 2: 28:06

In in this work in in particular in in Saikan, prompting the LLMs was just a path of least resistance. It was just very easy to do, and we could see the results straight away. But I think we we can start thinking about how can we combine it in one big system that can where we can be fine tuning the LLM planner as well as the low level skills jointly based on all the data that we that we are collecting.

Speaker 3: 28:31

K. Let's talk about some of the, the work that this this is, built upon. We won't go into integrate depth with these, but just just a few brief mentions of, for example, I think the RL algorithm you're using here is is MTOpt based on qtopt. Is that right? And and can you, briefly tell us about qtopt?

Speaker 3: 28:50

I think I if I understand correctly, that's what you're using to learn the grasping, from images with offline pre training?

Speaker 2: 28:57

That's right.

Speaker 3: 28:58

So why QT opt? There's other RL algorithms that that could do continuous control from images. Could you could you just spend a moment telling us why QT opt and MT opt for this for this case?

Speaker 2: 29:08

Right. Yeah. Of course. Yeah. So in our experience, we've been experimenting with a lot of these different algorithms and a lot of different variants.

Speaker 2: 29:17

I think one aspect that makes it a little bit different for us is that we try to do it at scale. So with a lot of different tasks, with a lot of robots, a lot of data, and so on. And often, the the algorithms as we see them on smaller scale benchmarks compared differently on larger scale. So with QT opt in particular, what we really like about that is that it's really stable. If if set up correctly, it just optimizes things quite well, and it usually works.

Speaker 2: 29:48

And it's much less fragile than algorithms that use, that are acryltic algorithms. I think we have some hunches on why that is. One kind of, one thought there is that, in QT opt, the optimization of of the of the actor is completely independent from the optimization of the the q function. And I think that makes it just, like, a little bit more robust setup where there's no actor. We we just have another optimizer that stays constant throughout training, and that kind of removes this one one aspect that can be quite difficult in in accurate algorithms.

Speaker 2: 30:27

So we just found it a little bit more stable in these large scale settings. But I think, this is not the the final answer. I think, you know, as we explore these algorithms more, and there's so many things to try, so many different combinations between these algorithms, different actor architectures, critic architectures, you know, optimization schemes, and so on, I think we'll get more answers just, at at the current time. To us, QTO was working the best.

Speaker 3: 30:55

And Faye, you mentioned, Realmogen, which I understand was part of your dissertation, and that was partly inspirational for this work. Can you briefly describe what, what that

Speaker 1: 31:05

adds? Yeah. Sure. So remote gen is a a a previous work by me, which explores using motion generation and reinforce learning together. So for that work, we are also tackling the problem of mobile manipulation.

Speaker 1: 31:20

It's super challenging because you you need to control where to move and where to do the manipulation. What we found, in that paper so that's basically a hierarchical reinforcement learning work, and the the low level is, motion generation, which is now learned, but rather some, classical like planning based methods. What we found is that, for this long horizon problems, it's beneficial to decompose the problem in a hierarchical fashion. And it will be even better if the steps decomposed are semantically meaningful, such as some, like, navigation steps interleaved with some manipulation steps. So I would say that's a a good inspiration for the seican line of work, where we also decompose a long horizon problem into a few, short horizon steps, which is, like, more manageable to learn, in the low level.

Speaker 3: 32:13

Okay. And then you also mentioned, another work, action models, where it uses hindsight relabeling and goal chaining, I guess, to to help with scaling the learning with a fixed amount of data. Can can you just say briefly what, what this what action models contribute here?

Speaker 2: 32:31

Yeah. So actionable models was the work that we that we did right after empty opt where, I think the the main kind of contribution in terms of sacon is quite nuanced here. So this is an offline URL method that takes all the data that we used for for emptyopt, we collected for emptyopt where we had a a prespecified set of tasks, like 12 or 14 tasks or something like that. These were encoded as just 1 hot vectors. So there was just task number 1, 2, 3, so on.

Speaker 2: 33:03

We collected a lot of data with it, then we did this multitask reinforcement learning with with QT opt called MTOpt. And we were constantly talking about what other tasks to add. And as you try to scale these systems, this question actually becomes quite tricky, especially as you, you know, try to do this at scale and you want the robots to run autonomously and so on. And this is something that didn't occur at least to me when when we were starting that project that, you know, you kind of have to come up with as many tasks as you can. And at the same task at the same time, these tasks have to potentially reset each other so that they can run continuously autonomously without any human intervention.

Speaker 2: 33:45

They also have to be meaningful tasks and very diverse and so on. So it seemed that at certain scale, just coming up with the tasks themselves becomes a bottleneck. So in in actionable models, we thought that rather than thinking of all kinds of different tasks, let's just do goal condition q learning. So let's consider every state as a potential task, as a potential goal, and try to get to that goal. This was done completely offline, so we didn't have to collect any of the shown data.

Speaker 2: 34:14

And we trained on all the data collected with MT OUT, and it worked really well. I think this was kind of a big moment for us in terms of, you know, these one hot representations that we were using before to represent tasks were kind of difficult to work with. And the goal images just seemed much closer to what we actually wanted. It also allowed us to just scale the number of tasks significantly because now an equal image is actually a a task representation. And I think that was a step towards getting to language condition policies where language is this kind of space in between, where it's very compositional.

Speaker 2: 34:53

It's very natural to express to the robot what you want it to do, much more, I think, natural than than goal image. But at the same time, language captures these different relationships between tasks much better than one hot vectors, for instance. So if we had a task that is, I don't know, pick up a carrot and pick up a cucumber, if one is represented as task number 1 and the other is represented as task number 2, in terms of the representation of these SAS, they're, like, completely orthogonal. There's nothing that they share versus, you know, the way that language was formed was such that, you know, we call things picking because whether it's picking carrot or picking cucumbers, they kind of look very similar. We the language kind of groups them together.

Speaker 2: 35:39

That's how it came about. So I think language, not it's not only a really good interface, but it also allows for better representations for learning all of these skills together.

Speaker 3: 35:49

Okay. Let's move on to follow-up work. This is called inner monologue embodied reasoning through planning with action with language models, and that was with yourself as author slash co authors. So this paper was published since we scheduled the interview, and it definitely feels like an unexpected bonus. So I'm excited to, to be able to chat about you with us.

Speaker 3: 36:10

You, you extend SACAN in in some different ways and get some really interesting results here. So can you talk about the general idea of of inner monologue?

Speaker 1: 36:18

For the inner monologue it's mainly try to address the shortcomings of SEKAN. So say can is more doing open loop planning, and it doesn't respond to certain failure cases. It would just continue to do a plan further. For inner monologue, we try to let the language model source feedback from the environment and from human, and then do closed loop planning. One big inspiration for the inner monologue is palm sake second because we find using large larger language models, it give us, like, some extra capability to play with.

Speaker 1: 36:52

So we try to have a more unstructured way of, like, prompting the language model and find it can still do the planning in a very high quality so that's kind of the main idea of inner monologue

Speaker 3: 37:04

so the text in, inner monologue looks looks a little bit like a screenplay script with the different actors and and there's some narration different statements by the human robot seen some questions and I gather that's all fell and fed into the language model so what are the different types of utterances here that the that the text can contain? It's different than say can. Right?

Speaker 1: 37:26

Right. That's, that's different than say can. I would like to say that the inner monologue is also inspired by Socratic models where they find you can just use multiple models to communicate using language. So this is exactly what we are doing here. There are different actors that can, can talk to each other.

Speaker 1: 37:45

And then, there is also a planning step which summarize the the whoever have talked and generate a plan. So here are some of the actors that we have here. The first is success detection, which it detect if a previous step is successful or not, and then it will say the action is successful or the action failed. 2nd, we have passive sync description. The passive sync description basically describes a scene.

Speaker 1: 38:11

It can be an object detector telling you there are certain objects in the scene. There are certain objects in certain locations. There are some, like, state of the object. This all pass this all fall into the passive scene description. There is also actives in description where the robot can actively ask questions about the scene.

Speaker 1: 38:29

It can ask human, like, what is the color of the certain things, or it can ask a human, here are 2 drinks. Which one do you like? So it will ask question when it feel it needs to. So these are the source of feedback that we are gathering.

Speaker 3: 38:43

So we we talked to Rohan Shah recently on the show, and he talked about this idea of active querying and and actually learning to learning to ask. But in in this setting here, how does your system, learn or figure out when it's appropriate to to ask?

Speaker 1: 38:57

We figure out where to ask mainly through it's still through a few shot prompting. We give it a couple examples when there are, ambiguous when when the query is ambiguous, then it will further ask to clarify the query. It's a little bit into the implementation detail where, like, the robot finish finishes a task, and we score different options. We score and continue and ask. Right?

Speaker 1: 39:23

So if the end ask score is higher, then we will, further prompt the language model to, ask a question. So here, it's a slight deviation from the second, like, scoring based decoding. But for these cases, we find generate generative decoding is more helpful here, and it can always generate, like, meaningful questions to answer to ask to, reduce ambiguity, for example.

Speaker 3: 39:47

And you reported some interesting, generalization and emergent capabilities, in this inner monologue work. Can you can you talk about that? And were some of those surprising to you?

Speaker 1: 39:57

Yeah. There are some, some generalization or emergent capability that are super surprising to us. So first, let me just briefly define, like, what are the emergent capability. I guess there are two meanings. Like, one is that the capability only emerges at a larger scale.

Speaker 1: 40:14

So in this case, we use the palm pathway language model, and, such capability only exhibit in such a scale. If we use a smaller language model, it probably will not have those capabilities. The second kind of, implicit meaning of emergent capability is that it's not, shown in the, like, few shot prompt, so it's completely new to, as those capabilities. One capability that we find is that, it can generalize to, like, a human changing the request. For example, a human say, go to go to do task a, and then as a robot was doing, we insert, go to do task b, and the robot will change its plan to do task b.

Speaker 1: 40:58

And here, we we can also say, never mind. Go go to finish your previous task. And then in this case, the robot go back to finish the task a. So this is quite surprising to us because we find that, it understand this history. It understand, what is what does a previous task mean, all due to, like, the large language model and our interface is very natural.

Speaker 1: 41:22

There are other couple emergent capabilities, such as, in in one case, we find it can also generalize to, like, you you can use emoji as a query. For example, you can say,

Speaker 2: 41:36

Speaker 1: 41:36

square, a yellow square, points to red circle, and it will put the yellow block into the red ball. So this is another, like, emergent capability that we see that is super interesting. Another very interesting emergent capability is that it can also propose a new plan based on a prompt such as try a new try a new method. Like when the robot fails at doing a task, there are usually two reasons. First, it's could it could be its manipulation policy has noise, so it fails at doing a task.

Speaker 1: 42:10

In this case, the best plan was to try again. There could also be that the plan is not feasible. For example, if if you are trying to grasp a block and the block is too heavy, and in this case, the plan would be to change a block to grasp. So we find we can just provide a tiny hint to the language model. We can just say, please try a different plan, or do you have any other idea?

Speaker 1: 42:33

And then the language model would just generate a different plan. So this is also super exciting for us. We have never seen this in, our previous work.

Speaker 3: 42:42

And I saw that, you you showed inter monologue did well on on unseen tasks actually to me it seems surprisingly well and where's the baseline method got all 0 so did you did you expect, it to be doing this well on on the unseen tasks?

Speaker 1: 42:56

I I think for, in terms of unseen tasks we, I guess you are referring to the comparison with clipboard.

Speaker 3: 43:03

Yes.

Speaker 1: 43:04

Yes. So the clipboard is trained to do, like, those tasks with demonstration. So in that case, it naturally doesn't generalize to, new tasks super well. Like, it's mainly, it will perform pretty well for the same task, but it doesn't generalize to novel tasks. That's because the clipboard doesn't leverage, like, the rich knowledge presented in the large language models.

Speaker 1: 43:28

In our work, the generalization mainly come from the language model. In that case, it's kind of natural for, in a monologue to generalize to novel tasks where we see some other methods, struggles.

Speaker 3: 43:40

And and does the, does inter monologue cons consider the whole text, as the prompt? Or you you mentioned a scratch pad. I I actually didn't follow that. Does it does it do you use the whole thing as the, as the prompt for the language model?

Speaker 1: 43:53

Right. It used the whole thing as, as, the prompt. So I I mentioned Scratchpad because there are some relevant work in the nlp community that kind of inspired the inner monologue. 2 of the papers are one is a chain of thought prompting, where they just allow language model to generate a chain of thought before actually decoding, the answer. Another is called scratch pad, where the language model can just, like, call different modules, and then, keep some notes in the scratch pad before decoding an answer.

Speaker 1: 44:27

In a in a in a monologue, we use the inner monologue itself as a scratch pad. So, every actor can write to that scratch pad, and then, we decode some actions. Like, every time we decode, for example, robot action, it is consuming our previous history steps as a prompt.

Speaker 3: 44:47

I see. Okay. Can we talk about this, set of pre trained skills? How do you expand this set of skills? Carol, you said that that was a large part of the work was was training these skills.

Speaker 3: 44:57

Do you see a way around that or is there a way to are you envisioning a way to automatically acquire skills unsupervised, or how how do you see that, scaling up?

Speaker 2: 45:05

Yeah. This is this is a really important question, and I can't emphasize enough how how difficult it is to actually to to work on the low level skills and how important it is. This is definitely the the bottleneck for the entire system. So I think one thing that is quite exciting about about Seiken is that before when we were thinking about what skills to add, would usually just, like, sit down and, you know, as as engineers and researchers will just think about it and vote or something like this and then add that skill. Now we are at the level where, Seiken starts to be useful and can be used in in an office, for instance, where the robot can, maybe bring you a snack from from a kitchen or something like this.

Speaker 2: 45:48

So it can start interacting with real users. I think this is probably much better way coming up with new tasks so we can just see what are the things that users ask for, quite often and then see what are the skills that would enable that so we can kind of more automatically decide what are the what are the the things that are missing. In terms of how to add the skills, there's many options there. So second is quite modular in that way, where you can add a skill that was trained with behavior cloning, with reinforcement learning. It could be a scripted skill.

Speaker 2: 46:22

Anything works as long as it has a an affordance function associated with it. So it kind of, allows us to consider all these options separately when we when we are thinking about these skills. We're kind of thinking about, potentially also having the the language model come up with the skills that that could be useful in in these settings so that would automate it even further. Overall, yeah, there's a lot of work that goes into this, and I hope that we'll have some answers or some reports on this soon. So stay tuned.

Speaker 3: 46:55

It seems to me the the use of these large language models in this context is maybe a bit of a double edged sword, like, on you showed in the in the intermetal monologue paper that you had a request in Chinese. And even though you didn't design it to understand Chinese necessarily because the language model had seen Chinese before it was able to understand it zero shot and do the right thing which is pretty amazing and then on the other hand it the language models would have all these things in them that you don't really need for this setting or maybe even I'm not sure if you'd even necessarily want. Like, do you want your kitchen robot to have read all of Reddit or to understand irony and all this stuff? I don't know. Maybe you do.

Speaker 3: 47:32

Can you talk about, like, the idea of using these general purpose language models for very specialized purposes. Do you do you think, in the future you'd want to have very specialized language models that were were kind of pared down? It seems to me there's like a good tension between like a good old fashioned AI system it just doesn't know enough things and you have to keep working hard to add facts and knowledge and here you have the opposite problem where you have an LLM which actually in some ways maybe knows too much do you is that a concern at all or or not so much?

Speaker 1: 48:03

First of all we're using the general purpose like large language model mainly because their their scale and emergent capability and the built in knowledge in that so it will be it will be difficult to shrink the model down, while still keeping these, knowledges, so that will be like one key challenge for us. However we do have motivation to, bring those model down, like to, kind of distill those models. One is one main thing is about efficiency. So currently, the inference is quite heavy, and we definitely want to, like, make it, smaller so that we can do inference faster. In terms of, like, the unwanted behavior, I would say current the secant decoding is quite safe because we only, allow it to output, like, certain actions using the scoring mode.

Speaker 1: 48:55

So we don't get a lot of, like, undesired behavior. So for us, if if we want to shrink a model down, it's mainly for, like, efficiency purposes, not for, like, unwanted behavior.

Speaker 2: 49:07

Yeah. I think the in terms of specializing these general purpose models right now, we that the main tool that we have for this other than affordance, scoring, and so on, is prompting. Right? So you can think of prompting as some way of of specifying the the task and specializing the model to to the specific thing that you want it to to do. I think as we gather more data for the for the tasks that we actually care about, we could also think of other ways, such as fine tuning the the model, fine tuning a set of parameters.

Speaker 2: 49:41

And I think there was kind of many options that we could we could consider there to make this the model a little bit more specialized. That it go beyond just prompting it.

Speaker 3: 49:50

So there's a line in the CKAN paper, and the conclusion says, it is also interesting to examine whether natural language is the right ontology to use to program robots and and I just observing that language models most of them seem pretty generic they are only conditioned on the previous text and so it's not maybe not clear how to condition them on on other things do you see wanting to have language models that can be conditioned on other things? Or do you think the vanilla language models, whether they're distilled or not, are the are are the right paradigm here? Any comments on that?

Speaker 2: 50:23

There may be 2 aspects to this, and this may be, like, a little more philosophical. So I think that the first aspect is that language just seems to be a really nice interface that is very interpretable for all of us, but it also captures the compositionality and the relationships between all the different tasks that we might consider the the robots to do. So I think it's just like a a really nice representation that potentially can make a robot learning easier. Because as we mentioned earlier, if you have 2 tasks that look very similar, they will be probably described by the same set of words. And I think that's that's really useful.

Speaker 2: 51:04

And kind of for free on top of that. You also get the interpretability of it. And then separately, I think this is what what your question is pointing towards. I think we we should be considering other modalities in these in these large models and how they can influence, you know, the the planners and and robot learning in general. I think something like inner monologue or Socratic models is just one way of doing this that is more practical because a lot of multimodal models have the language component.

Speaker 2: 51:37

So you can just kind of ask a vision vision language model to describe what it sees in language, and then that's the way you can incorporate it into your big, language model. But as these multimodal models get better and better, I would hope that we can incorporate much more into our prompt. We can incorporate what we currently see. You know, what's our confidence in the actions that we are about to take and so on. This would be just a much richer way of specifying or or kind of metaprogramming the robot.

Speaker 2: 52:07

Right? So not only you can just specify, I want you to help me clean something up, but maybe you can also demonstrate something. And that's also part of the of the prompt that the robot can then understand, you know, understand that you wanted to this thing to be picked up in a certain way or or something like that. So I think there's much more work to be done in in this interesting prompting multimodal prompting mechanisms that would allow us to to, teach robots better.

Speaker 3: 52:33

So I get that CKAN is a is lab work. It's not meant to be deployed in its current state. But, when we eventually get to these types of robots being deployed, do you think that they may have something in common with Seikan? Or what what do you think there's any parts of of, these systems that might be long term advances versus stepping stones? Or is it more a stepping stone situation?

Speaker 2: 52:54

Yeah. That's a that's a good question. I think, if we think of language models as these reasoning engines that can tell us a lot about the the semantics and about the world, In general, I think probably some form of this is here to stay. These seem to be just really, really powerful models that can understand common sense to a certain extent. And, that I think is very, very helpful, for for robot learning, and I think we'll see this going forward.

Speaker 2: 53:25

Maybe that will be a slightly different kind of model that can also incorporate other modalities as we mentioned. But I could I can imagine that some form of this some form of this distilled knowledge would stay.

Speaker 3: 53:38

Can you talk about a bit about how you think about your your future work? To what extent do you plan far in advance, or are you taking things more step by step? Do you replan all the time? How do you plan your your future work?

Speaker 2: 53:51

Yeah. That's a that's a good question. I think it depends on the on the individual. I think for, for this project, I tend to split it into 3 main aspects. The this data generation, we need to be able to just generate a lot of data with robots.

Speaker 2: 54:10

Then the other aspect is data sponge, algorithms. So just find algorithms that are able to absorb all of that data. And that's often very, very tricky, and we spend a lot of time there. And then the the third aspects are just things such as modeling. How do you get the models to be to be better?

Speaker 2: 54:30

And, I think for for a long time, the bottleneck was actually the the the algorithms themselves, how well they can absorb all the data. So we we saw, for instance, in in, in language that once transformers came out, they were just really, really good data sponges, and you can kind of throw a lot of data at them. And then you can observe this fascinating scaling loss, and the performance continues to improve. And we've been trying to do to to find an equivalent of that in robotics, whether it's an offline or algorithm or some imitation algorithm or or something else, something that can absorb as much data and as diverse data as possible. I think now we are slowly getting to the point where this is no longer a bottleneck.

Speaker 2: 55:16

There is a lot of algorithms that can absorb actually quite a lot of data. So I think we'll, we'll kind of then look at the the state of of things and see what is the bottleneck now. And I suspect that it will be data generation itself. So how can we, you know, develop algorithms or or develop even just processes for collecting very diverse data for very diverse tasks, either on the real robots or how can we incorporate human data? How can we just scale up our data collection significantly?

Speaker 3: 55:48

Are you surprised by, some of the fast progress in AI lately? And do you think it's gonna keep accelerating?

Speaker 2: 55:54

For me, I I personally am really surprised that the scaling laws continue to hold. I think I find it absolutely fascinating, and I think we kind of maybe take it take it for granted a little bit that, you know, we we saw it a few times, and now it's just like, considered maybe boring or some people refer to it as just, like, pure engineering. That there aren't any novel ideas. It's just about scaling things up. And I think, first, I think scaling things up is extremely hard, and I haven't really subscribed to the notion of it's just engineering.

Speaker 2: 56:25

I think it's just it's it's really, really hard, and, it's as much of a there's so many novelties there as much as in any novel research idea. And I think it was just yeah. It's it's mind blowing to me that we can make so much progress by pushing this one direction.

Speaker 3: 56:43

How do you see the kind of competition slash cooperation between different labs? And are other labs doing cool work too?

Speaker 2: 56:50

Yeah. There's plenty of other labs that do really cool work. I think we pay a lot of attention to what's happening in academia and in other industrial labs. I'm particularly interested in the algorithms that address problems that we, start noticing at scale. So it's I think we get a lot of inspiration from different works that come out from from different labs that sometimes maybe they don't even realize that this is the problem that that is, you know, really apparent when you scale things to, like, many robots or many robots doing many different tasks.

Speaker 2: 57:25

And, yeah, these are super, super useful things. We also tend to work with, with interns and student researchers, and it's always refreshing when they when they come in and bring in all kinds of new ideas and ways to to use our system. So, yeah, I think we we draw a lot of inspiration from those.

Speaker 3: 57:47

What do you think of of the concept of AGI? Do you find that that idea useful to talk about, or is it a distraction?

Speaker 2: 57:54

Maybe, like, on a on a more personal level, it's it's a little hard to think about it, about AGI when your day to day work is, you know, you're looking at the robot struggling with grasping like an apple, you know, on a on a countertop. So, like, when you see how hard these things are and, you know, how much work it takes to actually get it to do, like, the simplest things. It's kind of quite difficult to imagine, you know, all the steps that would need to be taken and how it just, like, at some point, will will progress and exponentially.

Speaker 1: 58:26

From my side, I like to be, like, more grounded and just to make solid progress towards a robot's capability. So I haven't been thinking about AGI too much. However, I do think, when people discuss AGI, they also think about, like, ethics and safety. And I think those are good to for us to think about, like, early on. When we start to build those method, we also take into, like, safety and ethics into consideration.

Speaker 1: 58:52

And I think, like, down the road, when we have more powerful models, we we are, salt we we are safe on that regard.

Speaker 3: 59:01

Makes sense. And I it seems that there's I mean, there's been such, great progress in terms of the language models, being able to write these big essays, the, image models being able to generate incredible art, and then there's kind of a gap between that and what we see in robotics. Are we waiting for something? Maybe it's the data sponge that you were talking about or the data generation, Carol, but are we waiting for some advance that can lead to some sort of, ImageNet moment for robotics? Is that ahead or is that behind us?

Speaker 2: 59:37

There's been a few moments that were significant, I think, in in robot learning, but I don't think we've had the the ImageNet moment yet. I I think one of the one of the underlying maybe hopes behind something like SACAN was to to kind of attach ourselves a little bit more towards the progress that is happening in other fields. Right? So if there if we find some way of of having language models improve robotics, then as they improve, robotics will improve as well or the same with multimodal models and so on as as shown in inner monologue. But I think, in terms of the low level skills, I think these are still the the early days.

Speaker 2: 01:00:15

We are, I think, quite bottlenecked by by the data available to us. There there aren't there isn't that much data out there of robots doing many different things, you know, nice datasets of just real robots doing diverse set of tasks. So that that's another struggle that we have to we have to incorporate in in all of this. But I think we're making decent progress there. But, yeah, I think the the bigger breakthroughs are still in front of us.

Speaker 3: 01:00:45

Is there anything else I should have asked you about, today or anything else you want to, share with our audience?

Speaker 1: 01:00:50

I guess I would just briefly mention that it's really inspiring to see that the progress of the natural language processing kind of trickle down into robotics and start to solve some of the robotics problem for us. In general, I think this more interdisciplinary, researching AI is super super exciting, and, we cannot wait to see more of that coming into robotics.

Speaker 2: 01:01:15

Yeah. I I fully agree. I think this, unification, I think it was really hard to think anything like this even a few years back, that, you know, some improvement that you can make to an architecture, can, you know, improve robotics and vision and language and and all of these things. So it's on one hand, it's super exciting to see something like this that we were kind of all pushing in in one direction, and we can all benefit from each other. And, even for us specifically at Google, we are closely collaborating with with language folks, with language researchers.

Speaker 2: 01:01:55

And, it's just very cool to have this, you know, interdisciplinary team, where we can all push in in a single direction. I mean, on the other hand, that's also important, especially for, academic labs to, you know, don't don't jump on the on the hype train. And maybe, like, either it's something that that you're really passionate about and, you know, something that that you believe will improve robotics or robot learning or whatever you're interested in, I think it's important to keep pushing on that as well. I'm a little worried that we'll lose a little bit of this diversity in terms of, you know, research ideas that are out there because of this unification. So I think it's important to keep both, but, yeah, it's it's super exciting.

Speaker 3: 01:02:48

Well, I wanna thank you both so much for joining us today and taking the time to share your insight with the TalkRL audience. Thank you so much, Fei Shah.

Speaker 1: 01:02:56

Thanks. Thanks for the invitation.

Speaker 3: 01:02:58

And thank you, Carol Hausman.

Speaker 2: 01:02:59

Thank you. Thanks for having us.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere