Hadoop: A Framework for Data-Intensive Distributed Computing – part I

By | April 23, 2013


Understanding what is “Big Data”

 Dealing  with  Big  Data  requires    an  in expensive,  reliable  storage  and  a  new  tool  for  analyzing structured  and  unstructured  data. Today, were surrounded by big data. People upload videos, take pictures on their  cell phones, text friends, update their Facebook status, leave comments around the web, click on ads, and so forth. Machines, too, are generating and keeping more and more data. We live in the data age. Its not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the digital universe at 0.18 zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes. A zettabyte is 10^21 bytes, or equivalently  one million petabytes, or one billion terabytes. Thats roughly the same order of magnitude as one disk drive for every person in the world. Thus Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.

 The exponential growth of data  presented challenges to cutting-edge  businesses such as Google, Yahoo, Amazon, and Microsoft. They needed to go through terabytes and petabytes  of data to figure out which websites were popular, what books were in demand, and what kinds of ads appealed to people. Existing tools were becoming inadequate to process such large data sets. Google was the first to publicize MapReducea system they had used to scale their data Processing needs. This system aroused a lot of interest because many other businesses were facing similar scaling challenges, and it wasnt feasible for everyone to reinvent their own proprietary  tool. Doug Cutting   saw an opportunity and led the charge to develop an open source  version of this MapReduce system called Hadoop . Soon after, Yahoo, rallied around to support this effort. Today, Hadoop is a core part of the computing infrastructure for many web companies, such as Yahoo, Facebook, LinkedIn, and Twitter. Many more traditional businesses, such as media and telecom, are beginning  to adopt this system too. Thus Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Hadoop distribute and parallelize data processing across many nodes in a compute cluster,


speeding up  large computations and  hiding  I/O latency through increased concurrency. It is well suited for large data processing like searching and indexing in huge data set.  


Hadoop includes Hadoop Distributed File System  (HDFS)  and  MapReduce.  It is  not possible for  storing  large  amount of  data on  a single node,  therefore Hadoop  use a new  file system  called  HDFS  which  split  data into  many  smaller  parts  and  distribute each  part redundantly  across  multiple  nodes. MapReduce  is  a software framework for the analysis and transformation of  very  large data sets.  Hadoop  uses MapReduce  function  for  distributed  computation.  MapReduce  programs  are  inherently  parallel.  Hadoop  take  advantage of  data  distribution  by  pushing  the work  involved  in  analysis  to  many  different servers.  Each  server  runs  the analysis  on  its  own  block  from  the file.  Results  are combined  in  to  single result  after  analyzing  each  piece.  MapReduce  framework  takes  care of  scheduling  tasks,  monitoring  them and re-executes the failed tasks. 


This  paper  does a detailed  study  on  Hadoop architecture and component working. Later focus on how data replicas are managed  in  Hadoop  distributed  file system  for  better  performance and  high  data availability for highly parallel distributed Hadoop Applications. This  paper also takes in account the different failure which will affect the Hadoop  system  and  various  failover mechanisms for handling those failures. 


 1.1. What is Hadoop?

Consider the example of facebook, Facebook data has grown upto 15TB/day by 2011 and in future shall produce data of a much higher magnitude. They have many web servers and huge MySql (profile, friends etc.) servers to hold the user data.  

Now to run various reports on these huge data For example Ratio of men vs. women users for a period, No of users who commented  on a particular day. The solution for this requirement they had scripts written in python which uses ETL processes. But as the size of data increased to this extent these scripts did not work.

Hence their main aim at this point of time was to handle data warehousing and their home ground solutions were not working. This is when Hadoop came into the picture.

 Formally speaking, Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. 

will be contd..

Please Share: Tweet about this on TwitterShare on FacebookShare on Google+Share on RedditPin on PinterestShare on LinkedInDigg thisShare on StumbleUponShare on TumblrBuffer this pageShare on VKEmail this to someone

Leave a Reply

Your email address will not be published. Required fields are marked *