Back

PyTorch at Tesla - Andrej Karpathy, Tesla

Transcript

[00:00:05] Andrej Karpathy Create clip good. Hello, everyone. I am Andre. I am the director of a I at Tuzla. And I'm very excited to be here to tell you a little bit about, um, pytorch And how we use pipe after Trey Neural networks for the auto pilot. Now, I'm curious to do a bit of a show of hands. How many of you actually own the Tuzla? Okay, a few. And how many of you have used or experienced the autopilot? The product? Okay, if you, um yes. So, uh, let's see, s So for those of you who may not be familiar with autopilot, uh, the basic functionality of the autopilot is that it keeps the car in the lane, and it keeps the car also away from the vehicle. Wait ahead of you. And then some of the most advanced functionality that we've been building for the autopilot includes navigating on a pilot, which allows you to set down a pen somewhere on the map. And then as long as you stick to highway, the car will do all of the lane changes automatically, and they will take hold the right forks to get you there.

[00:01:03] Andrej Karpathy Create clip So that's what navigate not a pilot with, uh, smart summon, which we only released about two weeks ago. You can summon the car to you in the parking lot, so you hold down, come to me and the car comes out of its parking spot and it will come find you in the parking lot. You get in like royalty, and it's an amazing, magical feature. More broadly, the team is very interested in pursuing kind of developing full sail driving capability. So that's what everyone is focused on in the team. Um, now, famously, perhaps we don't use light are and we don't use high division, high definition maps. So everything that we built for the autopilot is basically based on computer vision machine learning on the raw video streams that come from the eight cameras that surround the vehicle. So this is an example of what we might see in one single instant, and we process this, as you might imagine, with a lot of control convolution. You'll networks Now. Tesla is a fairly vertically integrated company on, and that's also true when it comes to the intelligence of the autopilot. So in particular, of course, we build our own cars and we arranged the senses around the vehicle, but then also, we collect all of our data. We label of the data, we train it on on premises, GPU, clusters. And then, of course, we take it through the entire stack. We run these networks on our own custom hardware that we develop in house. And then, of course, we are in charge of the full lifecycle of these features. So we deploy them to our fleet off almost 3/4 1,000,000 cars right now. And we look a telemetry and try to improve the feature over time s. So we kind of closed the loop on on this S O. I would like to slightly dive into some of the distributor training that we employing the team.

[00:02:34] Andrej Karpathy Create clip So the bread and butter for us is, of course, analyzing images. So here's an image. In order to drive in this environment, you actually have to understand a lot about this environment. So perhaps we have to understand that traffic lights, the Leila markings, cars and so on. So you end up in this very massively multi task setting very quickly, where you just have to know a lot about the scene so lover. A lot of our networks take on this outline here where you have kind of a shared backbone that has a number of tasks hanging off of it. And just to give you an idea of the work flows in the kinds of networks that these are typically arrested at 50 like backbones running on roughly 1000 by 1000 images. And then they have these heads of these structures that that makes sense. And of course, we're doing this partly because we can't afford to have neural networks for every single task. Because there's many, many tasks, almost almost 100 tasks. And so we have to advertise some of that computation. So we put the most shared backbones. So here's some examples of what these networks that we call hydro nets because of the shared backbone and multiple heads what these hydro nuts might look like.

[00:03:30] Andrej Karpathy Create clip Is this video playing? It's not okay. I'm just going to go to next video that was going to show you some little markings and so on. But this is a video showing you some road ages that we are interested in for the purposes of smart summon, because we have to understand where we can be in this environment, so we want to avoid the curbs in this case. Now, here we are making predictions in the image, and then we are, of course, casting them out and stitching them up across space and time to understand a sort of the layout of the scene around us. So here's an example of this occupancy grid. We're showing just the road edges and how they get projected, and the car winds its path around this parking lot while the person is summoning it. And it's just trying to find its way towards the goal through this parking lot. No. So here's how things get stitched up now. So far, I've only talked about in your networks that run on independent images, but of course, very quickly run across task that actually have to be a function of multiple images at the same time. So, for example, if you're trying to estimate depth of any of these images, it might actually be very hopeful to have access to the other views off that same scene in order to predict the depth of every individual pixel or if you're trying to predict the road layout. Or if you're trying to steer the wheel or something like that, you might actually need to borrow features from multiple other Hydra nets. So what this looks like is we have all of these different hydro nuts for different cameras. But then you might want to pull in some of the features from these hydro nets and go to a second run of processing, optionally recurrent and actually produce something like a road layout. Prediction. So that's an example of what a road layout prediction might look like for the autopilot. Um, we we are plugging in three cameras simultaneously into a neural network, and the network's predictions are not anymore. In the image space, they are in the top down space. So we're looking at the predictions off this network in particular. Here we are showing some of the predictions related to the corridors there are available in this parking lot, where the intersections are and what the orientations of all of these things are. And so the stitching up now doesn't happen sort of in a C plus plus code base, the stitching up across space and time happens inside, they're occurring. Neural network so more generally what our networks start to look like for all of these different tasks and were converging on is it looks something like this. We have eight hydro nuts for the H tasks. They all produce all kinds of intermediate predictions. But in addition to that, the features from these hydro let's go into a second round of processing. There's potentially recurrent, and then we have more outputs that are sort of in a top down view. And then what's special about this is that this is, of course, like a pretty large single network, and every single task sub samples parts of this network and trains just that small piece. So, for example, we can be training object detector on love, the cameras, or we can be training a depth network, or we can be training our layout network. All these tasks sub sample the graph and train only that portion. And then, if you've trained, recall neural networks of videos, you'll quickly kind of notice that these are not trivial training work clothes. So, as an example, if I want to back unrolled this graph in time and back propagates retired. Maybe we have eight cameras we unrolled for 16 times steps. It was a bad size of, say, 32. Then we are going to be holding in memory 4096 images and all of their activations in a single forward. Pass eso very quickly. Your typical distribute data parallel Well, we'll break because he can't hold this amount of memory, this amount of activations in memory of a single GP or even a single note eso. A lot of our training potentially has to combine some elements of data distribute this mystery date a parallel but also model perils and so on.

[00:06:50] Andrej Karpathy Create clip It also gets kind of tricky in terms of training these networks, because the typical simplest case might be around robin training of different tasks. So your training task one that ever older workers in the pool, our training task on the task to three etcetera, that gets out of hand when you have 100 tasks. So instead, what might make sense is to actually have a pool of tasks, and some of the tasks are doing objects. Some of the tasks are doing rudely out. Uh, some of the workers might be doing depth and so on and These are old, very heterogeneous work flows, but they co exist and they're training different pieces of the network at the same time. And then you can arrange them in synchronous asynchronous way or play with this to really get squeeze out all the juice out of it. But all in all, if you're trying to train all of the neural networks, 40 autopilot is actually ah, fairly expensive task in particular. Today we would train 48 different networks that make 1000 different predictions is just if you count the number of output 10 sirs and it takes 70,000 Jeep you hours to train to compile of the autopilot, at least the neural network sack. So if you had a single note with a GPS, you would be training for a year s. So it's a lot of networks on a lot of predictions, and one of them must work, and none of them can regress ever. And then you're not just training this once you are, you have to operate on this. So, of course, there are researchers and engineers and the team that actually have to improve on defense. As you can imagine, we do a lot of neural network training at scale to get this to actually work.

[00:08:09] Andrej Karpathy Create clip Um, and then we are automating a lot of the work flows is not just about the neural network training itself, but everything surrounding that. So in particular, we have to calibrate all different thresholds. We have a process for that. We have a lot of in the loop, Val from our interlude validation, other type of validation and evaluation to make sure that none of these 1000 different predictions that we make can regress and so on. And so the North Star for the team, though, is all of this can actually be automated. Quite well, eso starting with the data set, you can train older, never lets you can do all the calibration, the evaluation, and you can really start to see the continuous integration of this. And so the North Star for the team is something we internally someone jokingly referred to as Operation vacation. And the idea is that as long as the data labeling team is around and they're curating and improving our day, a sense that everything else can in principle, be automated and so we could actually go on a vacation, and the Annapolitan proves by default s So that's something that we really try Thio try to go towards in the team.

[00:09:01] Andrej Karpathy Create clip I would like to also talk a little bit about the inference aspect of this. Um, because I talked quite a bit about training a cz. I mentioned we have sort of our own back and that our hardware team has developed. We call this the FSB computer. It offers about 144 n Tate terror ops. Off capability compared to the GP was that we were using before we introduced this chip. This is roughly an order of magnitude improvement with lower costs. So we use this and all the latest cars that are now coming out of the production line, and we target all of the neural networks to these chips. And, uh, the last thing I wanted to also briefly Lou to, as you'll notice here on the bottom, we have a GP cluster. The hardware team is also working on a project we called Jo Jo and a dojo is a neural network training computer in the chip. And so we hope to do the exact same thing for training as we did for inference improve, Uh, basically the efficiency by ruffling ordered magnitude at a lower cost. But I'm not ready to talk about more details on that project just yet.

[00:09:57] Andrej Karpathy Create clip So in summary, I talked about the full lifecycle of developing these, your letters for the autopilot and how we own everything in house. The neural network is actually fairly complicated and large, and we dealing with a lot of problems if we actually want to train the beast. But, um, it's giving off some very interesting results. And the nice thing about that is, not only do we get to train really awesome large networks, but we also get to ship them. And so, for example, navigating autopilot has now accumulated one billion miles. We've confirmed 200,000 lane changes, and this is a global product product across 50 countries or more now. And so that's a lot of forward passes out there off neural networks and with smart summon, Uh, this is actually a bit of an outdated number. We now had 800,000 sessions of people trying to call their car to them on, so it's incredible to work on such a interesting product. Finally, I would like to thank the Pipe Christine for being incredibly responsive and helpful and sort of allowing us to develop all these networks and really trained them at scale on the deployment. The real world. It's been really an interesting collaboration.

[00:11:01] Andrej Karpathy Create clip Thank you.