TECHNICAL PAPERS

All You Need to Know About Apache Hive

Apache Hive is a data warehouse infrastructure that facilitates querying and managing large data sets which reside in a distributed storage system. Hive is built on top of Hadoop and developed by Facebook.

Hadoop (Read in detail) includes the Hadoop Distributed File System (HDFS) and MapReduce. It is not possible for storing a large amount of data on a single node, therefore Hadoop uses a new file system called HDFS which splits the data into many smaller parts and distribute each part redundantly across multiple nodes. MapReduce is a software framework for the analysis and transformation of very large data sets. Hadoop uses MapReduce function for distributed computation. 

Hive manages data stored in HDFS and provides a query language based on SQL for querying the data. Hive looks very much like a traditional database code with SQL access. Hive provides a way to query the data using a SQL-like query language called HiveQL (Hive Query Language). Internally, a compiler translates HiveQL statements into MapReduce jobs, which are then submitted to the Hadoop framework for execution.

Hive is based on Hadoop and MapReduce operations, there are several key differences between HiveQL and SQL. The first is that Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, you can expect queries to have very high latency (many minutes). This means that Hive would not be appropriate for applications that need very fast response times, as you would expect with a database such as DB2. Finally, the Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.

Suggested Read:
3 Things you Didn’t know Big Data could Do
Apache Hadoop – Introduction and architecture

Architectural view of Hadoop. Hive comes under Data Access layer of Hadoop.

Welcome to Apache Hive installation tutorial:

Step 1: Download and extract Hive from official site

  • Create a directory named hive and download the Hive tar file from below URL: https://archive.apache.org/dist/hive/
  • I am using stable version hive-2.1.0.
  • Right click and extract OR use the command tar -xvf apache-hive-2.1.0-bin.tar.gz in a terminal inside the hive folder.

Step 2:

  • Set path in bashrc file. Open as
#
gedit ~/.bashrc
# Set HIVE_HOME
export HIVE_INSTALL=/home/soft/hive
export PATH=$PATH:$HIVE_INSTALL/bin
# HIVE conf ends

Reload bashrc file as:  source ~/.bashrc

Step 3:

  • Hive is built on top of Hadoop, so for Hive to interact with Hadoop HDFS, it must know the path to the Hadoop installation directory.
  • Open hive-config.sh and add the below line:

                export HADOOP_HOME=/home/soft/hadoop

Step 4:

  • To store data into HDFS, create a directory within HDFS. This directory will be used by Hive to store all the data into HDFS.
    hadoop dfs –mkdir –p /user/hive/warehouse
  • Modify the folder permissions
    sudo chmod -R 775 /user/hive/warehouse

               

Step 5:

  • Very important step: you can start hive with above steps, But the issue is each time when u start hive metastore_db must be removed or renamed. But Metastore is the central repository of Apache Hive metadata. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using the metastore service API.
  • So we need to do below configuration.
  • Create a copy of hive-default.xml.template. Rename hive-default.xml(copy).template to hive-site.xml.
  • Next, edit hive-site.xml
  • Give full path for the metastore_db location as below.
#

    javax.jdo.option.ConnectionURL
jdbc:derby:;databaseName=/home/soft/hive/metastore_db;create=true
    

      JDBC connect string for a JDBC metastore.

      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.

      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.

and add below lines at the start of hive-site.xml

#

    system:java.io.tmpdir
    /tmp/hive/java
  
  
    system:user.name
    ${user.name}
  

Step 6:

  • Before starting hive you need to ensure your Hadoop Namenode and Datanode are also started.
  • Start your hive from terminal:
  • soft/hive/bin/hive

Enjoy querying to Hive. Let me know if you face any issues.

Rajesh Bandakkanavar

A Software Engineer having 6+ years of expertise in the field of Big Data and Hadoop. Rajesh loves sports and Yoga who spends most of his free time around new technologies.

Leave a Comment
Share
Published by
Rajesh Bandakkanavar

Recent Posts

Everything You Need To Know About Create React App

The configuration of resources can be a time-consuming and difficult operation while creating a React…

3 weeks ago

Causes of CyberCrime and Preventive Measures

          Rapid technological growth and developments have provided vast areas of…

1 month ago

How Data Lineage will Improve Business Practices

How often have you thought about changing the way that you store and use data?…

1 month ago

10 Dominating Programming Language for Mobile App In 2022

Programming Languages are a set of rules that aid in transforming a concept into a…

1 month ago

What is Serverless Edge Computing? Features | Benefits

Serverless edge computing is a new technology with a lot of promise, but it can…

1 month ago

Are Your Accounts Protected Against Cyberattacks?

Do any of your passwords include personal names, date of birth, or pet names? If…

2 months ago