26 January 2014

Interview: Linus Torvalds - "I don't read code any more"


(This was originally published in The H Open in November 2012.)

I was lucky enough to interview Linus quite early in the history of Linux – back in 1996, when he was still living in Helsinki (you can read the fruits of that meeting in this old Wired feature.) It was at an important moment for him, both personally – his first child was born at this time – and in terms of his career. He was about to join the chip design company Transmeta, a move that didn't really work out, but led to him relocating to America, where he remains today.

That makes his trips to Europe somewhat rare, and I took advantage of the fact that he was speaking at the recent LinuxCon Europe 2012 in Barcelona to interview him again, reviewing the key moments for the Linux kernel and its community since we last spoke.

Glyn Moody: Looking back over the last decade and half, what do you see as the key events in the development of the kernel?

Linus Torvalds: One big thing for me is all the scalability work that we did. We've gone from being OK on 2 or 4 CPUs to the point where basically you can throw 4000 [at it] – you won't scale perfectly, but most of the time it's not the kernel that's the bottleneck. If your workload is somewhat sane we actually scale really well. And that took a lot of effort.

SGI in particular worked a lot on scaling past a few hundred CPUs. Their initial patches could just not be merged. There was no way we could take the work they did and use it on a regular PC because they added all this infrastructure to work on thousands of CPUs. That was way too expensive to do when you had only a couple.

I was afraid for the longest time that we would have the high-performance kernel for the big machines, and the source code would be separate from the normal kernel. People worked a lot on just making sure that we had a clean code base where you can say at compile time that, hey, I want the kernel that works for 4000 CPUs, and it generates the code for that, and at the same time, if you say no, I want the kernel that works on 2 CPUs, the same source code compiles.

It was something that in retrospect is really important because it actually made the source code much better. All the effort that SGI and others spent on unifying the source code, actually a lot of it was clean-up – this doesn't work for a hundred CPUs, so we need to clean it up so that it works. And it actually made the kernel more maintainable. Now on the desktop 8 and 16 CPUs are almost common; it used to be that we had trouble scaling to an 8, now it's like child's play.

But there's been other things too. We spent years again at the other end, where the phone people were so power conscious that they had ugly hacks, especially on the ARM side, to try to save power. We spent years doing power management in general, doing the kind of same thing - instead of having these specialised power management hacks for ARM, and the few devices that cellphone people cared about, we tried to make it across the kernel. And that took like five years to get our power management working, because it's across the whole spectrum.

Quite often when you add one device, that doesn't impact any of the rest of the kernel, but power management was one of those things that impacts all the thousands of device drivers that we have. It impacts core functionality, like shutting down CPUs, it impacts schedulers, it impacts the VM, it impacts everything.

It not only affects everything, it has the potential to break everything which makes it very painful. We spent so much time just taking two steps forward, one step back because we made an improvement that was a clear improvement, but it broke machines. And so we had to take the one step back just to fix the machines that we broke.

Realistically, every single release, most of it is just driver work. Which is kind of boring in the sense there is nothing fundamentally interesting in a driver, it's just support for yet another chipset or something, and at the same time that's kind of the bread and butter of the kernel. More than half of the kernel is just drivers, and so all the big exciting smart things we do, in the end it pales when compared to all the work we just do to support new hardware.

Glyn Moody: What major architecture changes have there been to support new hardware?

Linus Torvalds: The USB stack has basically been re-written a couple of time just because some new use-case comes up and you realise that hey, the original USB stack just never took that into account, and it just doesn't work. So USB 3 needs new host controller support and it turns out it's different enough that you want to change the core stack so that it can work across different versions. And it's not just USB, it's PCI, and PCI becomes PCIe, and hotplug comes in.

That's another thing that's a huge difference between traditional Linux and traditional Unix. You have a [Unix] workstation and you boot it up, and it doesn't change afterwards - you don't add devices. Now people are taking adding a USB device for granted, but realistically that did not use to be the case. That whole being able to hotplug devices, we've had all these fundamental infrastructure changes that we've had to keep up with.

Glyn Moody: What about kernel community – how has that evolved?

Linus Torvalds: It used to be way flatter. I don't know when the change happened, but it used to be me and maybe 50 developers - it was not a deep hierarchy of people. These days, patches that reach me sometimes go through four levels of people. We do releases every three months; in every release we have like 1000 people involved. And 500 of the 1000 people basically send in a single line change for something really trivial – that's how some people work, and some of them never do anything else, and that's fine. But when you have a thousand people involved, especially when some of them are just these drive-by shooting people, you can't have me just taking patches from everybody individually. I wouldn't have time to interact with people.

Some people just specialise in drivers, they have other people who they know who specialise in that particular driver area, and they interact with the people who actually write the individual drivers or send patches. By the time I see the patch, it's gone through these layers, it's seldom four, but it's quite often two people in between.

Glyn Moody: So what impact does that have on your role?

Linus Torvalds: Well, the big thing is I don't read code any more. When a patch has already gone through two people, at that point, I can either look at the patch and say: no, all your work was wasted, and micromanage at that level – and quite frankly I don't want to do that, and I don't have the capacity to do that.

So most of the time, when it comes to the major subsystem maintainers, I trust them because I've been working with them for 5, 10, 15 years, so I don't even look at the code. They tell me these are the changes and they give me a very high-level overview. Depending on the person, it might be five lines of text saying this is roughly what has changed, and then they give me a diffstat, which just says 15 lines have changed in that file, and 25 lines have changed in that file and diffstat might be a few hundred lines because there's a few hundred files that have changed. But I don't even see the code itself, I just say: OK, the changes happen in these files, and by the way, I trust you to change those files, so that's fine. And then I just say: I'll take it.

Glyn Moody: So what's your role now?

Linus Torvalds: Largely I'm managing people. Not in the logistical sense – I obviously don't pay anybody, but I also don't have to worry about them having access to hardware and stuff like that. Largely what happens is I get involved when people start arguing and there's friction between people, or when bugs happen.

Bugs happen all the time, but quite often people don't know who to send the bug report to. So they will send the bug report to the Linux Kernel mailing list – nobody really is able to read it much. After people don't figure it out on the kernel mailing list, they often start bombarding me, saying: hey, this machine doesn't work for me any more. And since I didn't even read the code in the first place, but I know who is in charge, I end up being a connection point for bug reports and for the actual change requests. That's all I do, day in and day out, is I read email. And that's fine, I enjoy doing it, but it's very different from what I did.

Glyn Moody: So does that mean there might be scope for you to write another tool like Git, but for managing people, not code?

Linus Torvalds: I don't think we will. There might be some tooling, but realistically most of the things I do tend to be about human interaction. So we do have tools to figure out who's in charge. We do have tools to say: hey, we know the problem happens in this area of the code, so who touched that code last, and who's the maintainer of that subsystem, just because there are so many people involved that trying to keep track of them any other way than having some automation just doesn't work. But at the same time most of the work is interaction, and different people work in different ways, so having too much automation is actually painful for people.

We're doing really well. The kind of pain points we had ten years ago just don't exist any more. And that's largely because we used to be this flat hierarchy, and we just fixed our tools, we fixed our work flows. And it's not just me, it's across the whole kernel there's no single person who's in the way of any particular workflow.

I get a fair amount of email, but I don't even get overwhelmed by email. I love reading email on my cellphone when I travel, for example. Even during breaks, I'll read email on my cellphone because 90% of them I can just read for my information that I can archive. I don't need to do anything, I was cc'd because there was some issue going on, I need to be aware of it, but I don't need to do anything about that. So I can do 90% of my work while travelling, even without having a computer. In the evening, when I go back to the hotel room, I'll go through [the other 10%].

Glyn Moody: 16 years ago, you said you were mostly driven by what the outside world was asking for; given the huge interest in mobiles and tablets, what has been their impact on kernel development?

Linus Torvalds: In the tablet space, the biggest issue tends to be power management, largely because they're bigger than phones. They have bigger batteries, but on the other hand people expect them to have longer battery life and they also have bigger displays, which use more battery. So on the kernel side, a tablet from the hardware perspective and a usage perspective is largely the same thing as a phone, and that's something we know how to do, largely because of Android.

The user interface side of a tablet ends up being where the pain points have been – but that's far enough removed from the kernel. On a phone, the browser is not a full browser - they used to have the mobile browsers; on the tablets, people really expect to have a full browser – you have to be able to click that small link thing. So most of the tablet issues have been in the user space. We did have a lot of issues in the kernel over the phones, but tablets kind of we got for free.

Glyn Moody: What about cloud computing: what impact has that had on the kernel?

Linus Torvalds: The biggest impact has been that even on the server side, but especially when it comes to cloud computing, people have become much more aware [of power consumption.] It used to be that all the power work originally happened for embedded people and cellphones, and just in the last three-four years it's the server people have become very power aware. Because they have lots of them together; quite often they have high peak usage. If you look at someone like Amazon, their peak usage is orders of magnitude higher than their regular idle usage. For example, just the selling side of Amazon, late November, December, the one month before Christmas, they do as much business as they do the rest of the year. The point is they have to scale all their hardware infrastructure for the peak usage that most of the rest of the year they only use a tenth of that capacity. So being able to not use power all the time [is important] because it turns out electricity is a big cost of these big server providers.

Glyn Moody: Do Amazon people get involved directly with kernel work?

Linus Torvalds: Amazon is not the greatest example, Google is probably better because they actually have a lot of kernel engineers working for them. Most of the time the work gets done by Google themselves. I think Amazon has had a more standard components thing. Actually, they've changed the way they've built hardware - they now have their own hardware reference design. They used to buy hardware from HP and Dell, but it turns out that when you buy 10,000 machines at some point it's just easier to design the machines yourself, and to go directly to the original equipment manufacturers and say: I want this machine, like this. But they only started doing that fairly recently.

I don't know whether [Amazon] is behind the curve, or whether Google is just more technology oriented. Amazon has worked more on the user space, and they've used a fairly standard kernel. Google has worked more on the kernel side, they've done their own file systems. They used to do their own drivers for their hard discs because they had some special requirements.

Glyn Moody: How useful has Google's work on the kernel been for you?

Linus Torvalds: For a few years - this is five or ten years ago - Google used to be this black hole. They would hire kernel engineers and they would completely disappear from the face of the earth. They would work inside Google, and nobody would ever hear from them again, because they'd do this Google-specific stuff, and Google didn't really feed back much.

That has improved enormously, probably because Google stayed a long time on our previous 2.4 releases. They stayed on that for years, because they had done so many internal modifications for their specialised hardware for everything, that just upgrading their kernel was a big issue for them. And partly because of the whole Android project they actually wanted to be much more active upstream.

Now they're way more active, people don't disappear there any more. It turns out the kernel got better, to the point where a lot of their issues just became details instead of being huge gaping holes. They were like, OK, we can actually use the standard kernel and then we do these small tweaks on top instead of doing these big surgeries to just make it work on their infrastructure.

Glyn Moody: Finally, you say that you spend most of your time answering email: as someone who has always seemed a quintessential hacker, does that worry you?

Linus Torvalds: I wouldn't say that worries me. I end up not doing as much programming as sometimes I'd like. On the other hand, it's like some kinds of programming I don't want to do any more. When I was twenty I liked doing device drivers. If I never have to do a single device driver in my life again, I will be happy. Some kind of headaches I can do without.

I really enjoyed doing Git, it was so much fun. When I started the whole design, started doing programming in user space, which I had not done for 15 years, it was like, wow, this is so easy. I don't need to worry about all these things, I have infinite stack, malloc just works. But in the kernel space, you have to worry about locking, you have to worry about security, you have to worry about the hardware. Doing Git, that was such a relief. But it got boring.

The other project I still am involved in is the dive computer thing. We had a break in on the kernel.org site. It was really painful for the maintainers, and the FBI got involved just figuring out what the hell happened. For two months we had almost no kernel development – well, people were still doing kernel development, but the main site where everybody got together was down, and a lot of the core kernel developers spent a lot of time checking that nobody had actually broken into their machines. People got a bit paranoid.

So for a couple of months my main job, which was to integrate work from other people, basically went away, because our main integration site went away. And I did my divelog software, because I got bored, and that was fun. So I still do end up doing programming, but I always come back to the kernel in the end.

No comments:

Post a Comment