Hadoop: A Framework for Data-Intensive Distributed Computing – part II

By | May 2, 2013

1.1. What is Hadoop?
Consider the example of facebook, Facebook data has grown upto 15TB/day by 2011 and in future shall produce data of a much higher magnitude. They have many web servers and huge MySql (profile, friends etc.) servers to hold the user data.
Now to run various reports on these huge data For example Ratio of men vs. women users for a period, No of users who commented on a particular day. The solution for this requirement they had scripts written in python which uses ETL processes. But as the size of data increased to this extent these scripts did not work.

Hence their main aim at this point of time was to handle data warehousing and their home ground solutions were not working. This is when Hadoop came into the picture.

Formally speaking, Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is
“  AccessibleHadoop runs on large clusters of commodity machines or on cloud computing services such as Amazons Elastic Compute Cloud (EC2)
“  RobustBecause it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures.
“  ScalableHadoop scales linearly to handle larger data by adding more nodes to
the cluster.
“  SimpleHadoop allows users to quickly write efficient parallel code.

Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project. Hadoops accessibility and simplicity give it an edge over writing and running large distributed programs. On the other hand, its robustness and scalability make it suitable for even the most demanding jobs at Yahoo and Facebook. These features make Hadoop popular in both academia and industry.

2. Hadoop System Architecture:
The Hadoop architecture system consists of following components :

HBase : A distributed, column-oriented database. HBase uses HDFS for its underlying
storage, and supports both batch-style computations using MapReduce and point
queries (random reads).
HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original Big Table paper. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop. HBase is not a direct replacement for a classic SQL database, although recently its performance has improved, and it is now serving several data-driven websites, including Facebooks Messaging Platform.

Hadoop System Architecture

Hadoop System Architecture


Figure 1.Hadoop System Architecture
HDFS: A distributed filesystem that runs on large clusters of commodity machines.
MapReduce: MapReduce is a functional programming paradigm that is well suited to handling parallel processing of huge data sets distributed across a large number of computers, or in other words, MapReduce is the application paradigm supported by Hadoop and the infrastructure presented in this article. MapReduce, as its name implies, works in two steps.
Avro : is the serialization framework created by Doug Cutting, the creator of Hadoop. With Avro we can store data and read it easily with various programming languages. It is optimized to minimize the disk space needed by our data and it is flexible after adding or removing fields to our data we can still keep reading files previous to the change.

Hive: A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL for querying the data. Hive looks very much like traditional database code with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences. The first is that Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, you can expect queries to have a very high latency (many minutes). This means that Hive would not be appropriate for applications that need very fast response times, as you would expect with a database such as DB2. Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.

Pig: A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. Pig was initially developed at Yahoo to allow people using Hadoop to focus more on analyzing large data sets and spend less time having to write mapper and reducer programs. Pig is made up of two components: the first is the language itself, which is called PigLatin and the second is a runtime environment where PigLatin programs are executed.

Chukwa: Chukwa is a data collection and Analysis Framework that works with Hadoop to process and analyze the huge logs generated. It is built on top of the Hadoop Distributed File System (HDFS) and Map Reduce Framework. It is highly flexible tool that makes Log analysis, processing and monitoring easier especially while handling Distributed File Systems like Hadoop.

ZooKeeper : Zookeeper is an open source Apache project that this information in local log files. A very large Hadoop cluster can be supported by multiple ZooKeeper servers (in this case, a master server synchronizes the top-level servers). Each client machine communicates with one of the ZooKeeper servers to retrieve and update its synchronization information.
Within ZooKeeper, an application can create what is called a znode . The znode can be up¬dated by any node in the cluster, and any node in the cluster can register to be informed of changes to that znode . Using this znode infrastructure , applications can synchronize their tasks across the distributed clus¬ter by updating their status in a ZooKeeper znode, which would then inform the rest of the cluster of a specific nodes status change. This cluster-wide status centralization service is essential for management and serialization tasks across a large distributed set of servers.
Sqoop : Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. This is where Apache Sqoop fits in. Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouseses.

What happens underneath the covers when you run Sqoop is very straightforward. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe manner since Sqoop uses the database metadata to infer the data types.

will be contd..

Please Share: Tweet about this on TwitterShare on FacebookShare on Google+Share on RedditPin on PinterestShare on LinkedInDigg thisShare on StumbleUponShare on TumblrBuffer this pageShare on VKEmail this to someone

Leave a Reply

Your email address will not be published. Required fields are marked *