Yahoo champions open source project Hadoop
Tuesday, November 13th, 2007With the advent of more and more open source technology, Yahoo has gotten into the game with their sponsorship of Hadoop, a project which mimics some of Google’s techniques for storing and processing vast amounts of data across thousands of commodity PCs. While Google has understandably shied away from making their code available through open source products, Yahoo and Hadoop are setting themselves up to circumvent and replicate Google’s success.
For the less tech-savvy, open source technology or software is that in which the original code is made public. This allows programmers to manipulate and customize software for their needs. Some of the best-known open source programs include Mozilla Firefox (a fully-customizable web browser), WordPress (a popular blogging tool), and OpenOffice (an office suite with word processing, spreadsheet, and multimedia capabilities). These tools are often free, and their flexibility makes them useful to programmers and casual users alike.
Hadoop is an open source platform which gives programmers the ability to write and run applications that replicate the technology which has made Google a success. It consists of a distributed file system known as the “Hadoop Distributed File System”, which is similar to the Google File System. It also implements a MapReduce function, where any given application is divided into small pieces to be processed by any node in the cluster of commodity PCs. This distributed computing environment allows computer-intensive tasks to be divided and assigned to individual computers in a cluster. In the case of a search index, this would mean thousands of computers are each assigned to index a smaller piece of data, and then the results are sorted and merged to create a usable data set.
While Google has not released code for their file system or MapReduce tools, they have published academic papers on their technology, and they must have realized that someone would make an open source product available based on their innovations. The fact that Yahoo is involved is no surprise, since they have the most to gain from piggy-backing on this kind of technology. What is surprising, however, is that Yahoo has not yet implemented the system into their web crawl data. They are said to be using Hadoop for other purposes, such as market research and product development. For example, Hadoop might be used to look for snippets of code (like that used to display a Flickr or Digg badge) instead of locating keywords and links.
Hadoop’s website makes the following claims for its usefulness:
- Scalable: Hadoop can reliably store and process petabytes.
- Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.
- Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid
- Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.
It seems that Yahoo agrees with the principles of the project, and the way in which they are implementing these techniques could be called innovative. While Yahoo and Hadoop certainly have a reason to “copycat” the systems pioneered at Google, they are doing so in a way that makes the technology available to programmers everywhere; it’s certainly a little less distasteful than if they were plagiarizing without sharing. Open source software has proven its competitiveness in the market, and its only a matter of time before users begin to demand a higher level of customization in every application they employ. Hadoop is making the first move towards an open source model in data distribution and storage, and Yahoo is smart to take advantage of it.
By Haley January Eckels




