14 November 2007

Yahoo! Goes Whoop! About Hadoop! (and Pig!)

Now why on earth would Yahoo be doing this?

Yahoo! Inc., a leading global Internet company, today announced that it will be the first in the industry to launch an open source program aimed at advancing the research and development of systems software for distributed computing. Yahoo!'s program is intended to leverage its leadership in Hadoop, an open source distributed computing sub-project of the Apache Software Foundation, to enable researchers to modify and evaluate the systems software running on a 4,000 processor supercomputer provided by Yahoo!. Unlike other companies and traditional supercomputing centers, which focus on providing users with computers for running applications and for coursework, Yahoo!'s program focuses on pushing the boundaries of large-scale systems software research.

Currently, academic researchers lack the hardware and software infrastructure to support Internet-scale systems software research. To date, Yahoo! has been the primary contributor to Hadoop, an open source distributed file system and parallel execution environment that enables its users to process massive amounts of data. Hadoop has been adopted by many groups and is the software of choice for supporting university coursework in Internet-scale computing. Researchers have been eager to collaborate with Yahoo! and tap the company's technical leadership in Hadoop-related systems software research and development.

As a key part of the program, Yahoo! intends to make Hadoop available in a supercomputing-class data center to the academic community for systems software research. Called the M45, Yahoo!'s supercomputing cluster, named after one of the best known open star clusters, has approximately 4,000 processors, three terabytes of memory, 1.5 petabytes of disks, and a peak performance of more than 27 trillion calculations per second (27 teraflops), placing it among the top 50 fastest supercomputers in the world.

M45 is expected to run the latest version of Hadoop and other state-of-the-art, Yahoo!-supported, open-source distributed computing software such as the Pig parallel programming language developed by Yahoo! Research, the central advanced research organization of Yahoo! Inc.

It's cool that Yahoo's backing the open source Hadoop, and doubly cool that one of the projects is called Pig. But it's also shrewd. It's becoming abundantly clear that open beats closed; Google, for all its use of open source software, is remarkably closed at its core. Enter Hadoop, running on a 4,000 processor supercomputer provided by Yahoo, with the real possibility of spawning a truly open rival to Google.... (Via Matt Asay.)


Anonymous said...

Google's already working with IBM to teach courses using Hadoop, which after all is based on work published by Google. Yahoo is not doing it cause open beats closed. They're doing it just to stay in Google's rear-view mirror. MapReduce is a remarkably elegant concept (and in the public domain), so in my book, it's still Google that's pushing the field here.

Justin Mason said...

whoa. Maybe they'll let us use it in Apache SpamAssassin -- sounds like I need to rewrite mass-check to run on Hadoop... ;)

fwiw, I can see the point -- right now, in many fields, it seems you're either first and biggest (MapReduce), or you go for open and hope that brings additional momentum from outside (Hadoop).

Glyn Moody said...


Anonymous said...

Very true. Actually Yahoo does a great job of recognizing the value of Open Source projects (thanks to folks like Jeremy Zawodny).

There's tons of great Hadoop deployment examples out there, especially on Amazon's EC2 (if you happen to need a few 100 processors)