Monday, September 16, 2013

Setting up and Running Hive on Hadoop


Hive is a data warehousing infrastructure based on the Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substantial overheads in job submission and scheduling. As a result - latency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as Oracle where analyses are conducted on a significantly smaller amount of data but the analyses proceed much more iteratively with the response times between iterations being less than a few minutes. Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.

Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs).

Below we discuss the process to set up and run hive on our hadoop cluster. Please note that this blog doesnot discuss the concepts and architecture of hive. For that You should go thru The Hadoop Definitive Guide from Tom White or Programming Hive from O'Reilly.

For this tutorial the prerequisite is that we already have a hadoop cluster up and running. If you don't have one, then don't worry, You can visit my previous blogs about setting up hadoop:

Installing Hive:

Step 1: Download a stable Hive release version from the Apache Hive Download Website : http://www.apache.org/dyn/closer.cgi/hive/ . I downloaded Hive 0.11 for my set up.

Step 2: Place the tar.gz file in your bigdata folder and then untar it.(Remember the folder structure I had created for my hadoop Install in my previous blog?)

cd bigdata
tar -xvf hive-0.11.0.tar.gz

Step 3: Rename the extracted folder to hive :
mv hive-0.11.0 hive

Step 4: Add export commands for HIVE_HOME and add the bin directory of hive to the SYSTEM PATH as we had done earlier for HADOOP and PIG.

export HIVE_HOME=/home/hduser/bigdata/hive
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$PIG_HOME/bin:$HIVE_HOME/bin.

Step 5: Create the warehouse folder on hdfs which is  /user/hive/warehouse by default as defined to hive. You can see this in the hive-default.xml.template file in $HIVE_HOME/conf.
hadoop fs -mkdir /user/hive
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod -R 755 /user/hive/warehouse

Hive needs a RDBMS to store its Meta information (databases, tables, indexes etc. info)
Derby: Hive has an embedded database called Apache Derby (java DB). It is useful for development activities. It has some limitations like it allows only one connection; there is no global database management. It uses current directory has the database location.
MySQL: We can store Meta information in Full featured RDBMS like MySQL or Oracle for production deployments where many developers trying to connect Hadoop cluster concurrently. MySQL is the preferred option here for choosing our metadata repository database.
We have the following options to run MySQL server.
 Run MySQL on local machine

 Run MySQL on dedicated machine (remote machine)
Here MySQL runs on one machine all other hive clients are connect to the shared Meta Information which is in this MySQL installation. This communication happens via a protocol called thrift. We have to run hive-server for this communication.

Step 6: Installing and configuring MySQL.
sudo apt-get install mysql-server
While installing, it will ask to set a password for "root" user. Set it a password that you don't forget.
After install, to make sure your mysql server is running, issue the below command
sudo netstat -tap | grep mysql

Step 7: Download MySql Java connector from http://dev.mysql.com/downloads/connector/j/#downloads and place the connector jar file that we get after extracting the tar.gz file in $HIVE_HOME/lib to get it added to the class path.

Step 8: Create a MySql user for Hive.
Enter mysql using the command:
mysql -u root -p
mysql> create user 'hive'@'hostname' identified by 'password';
I used localhost as my hostname here.

Step 9: Create a database for hive metastore:
mysql> Create database hivemetastore;

Step 10: Grant all privileges to hive user on the database hivemetastore:
mysql> Grant all on hivemetastore.* to 'hive'@'hostname';
mysql> flush privileges;
mysql> exit;
Again hostname is localhost here.

Step 11: Copy hive-default.xml.template to hive-site.xml in $HIVE_HOME/conf directory. Now edit hive-site.xml file to set the following properties:
<property>
  <name>hive.metastore.uris</name>
  <value>thrift://hostname:10000</value>
  <description>IP address (or fully-qualified domain name) and port of the metastore host. I used hostname as localhost here.</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://hostname:3306/hivemetastore</value>
  <description>I used hostname as localhost here.</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>YourPassword</value>
</property>

Step 12: Start the thrift server by issuing the below command
hive --service hiveserver &

Step 13: Start the Hive Metastore with the below command
hive --service metastore &

Now hive is up and running on Your client. You can enter Hive Shell by issuing the command hive. Lets start testing and discovering hive features by issuing the create database command.
hive> create database nyse; 
Whenever database is created then hive creates the directory in the dataware house location with an extentsion of .db.
We can override the location by using the following syntax:
hive>create database databaseName location '/our own path' comment 'some comments' (used for describing something about database)
hive>describe database nyse;
Now lets create our stocks table to hold the daily prices data. Lets make this a partitioned external table partitioned by the first character of stock name.
For this lets create the stocks table as below:
CREATE EXTERNAL TABLE IF NOT EXISTS stocks ( exchange STRING, symbol STRING, state STRING, price_open FLOAT, price_high FLOAT, price_low FLOAT, price_close FLOAT, volume INT, price_adj_close FLOAT)  PARTITIONED BY(symbol_char STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
Now lets create a temp table to hold the stocks price data temporarily from the file that we already have in HDFS:
CREATE EXTERNAL TABLE IF NOT EXISTS stockstemp ( exchange STRING, symbol STRING, state STRING, price_open FLOAT, price_high FLOAT, price_low FLOAT, price_close FLOAT, volume INT, price_adj_close FLOAT)  ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
LOAD DATA INPATH 'NYSE/prices/' OVERWRITE INTO TABLE stockstemp;
Now insert data into the main stocks table from the stockstemp table loading partition by partition based on the symbol char.
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='A') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'A'; 
Now lets put all other queries to load other partitions starting from B to Z in a separate file with .q extension.
cd $HIVE_HOME/scripts
vim loadStocks
The hive script loadStocks holds the below HQL statements:
use nyse;
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='B') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'B';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='L') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'L';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='C') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'C';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='D') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'D';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='E') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'E';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='F') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'F';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='G') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'G';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='H') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'H';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='I') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'I';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='J') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'J';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='K') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'K';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='M') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'M';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='N') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'N';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='O') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'O';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='P') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'P';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='Q') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'Q';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='R') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'R';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='S') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'S';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='T') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'T';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='U') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'U';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='V') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'V';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='W') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'W';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='X') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'X';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='Y') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'Y';
INSERT OVERWRITE TABLE stocks PARTITION(symbol_char='Z') SELECT * FROM stockstemp WHERE SUBSTR(symbol,1,1) = 'Z';

Now run this hql script:
 hive -f loadStocks.q
We see that a directory is created for each of the above partitions. The data is stored in files inside these directories. You can run select query on the whole stocks table and look out for more advanced features supported by hive. Programming Hive from O'Reilly is a good book to start learning hive.

Tuesday, September 3, 2013

Installing and Running Pig on Your Hadoop Cluster


Apache Pig provides an engine for executing data flows in parallel on Hadoop.

  1. Pig's infrastructure layer consists of a compiler that produces sequences of MapReduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). 
  2. Pig's language layer currently consists of a textual language called Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, group etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. 

Key Properties of Pig Latin because of which it is so popular esp on Hadoop ecosystem:

  • Ease of programming 
  • Optimization opportunities
  • Extensibility

This post will talk about installing Pig on top of Your hadoop cluster(You can also run it in local mode) and running a pig script. The pre requisite for this post is that You already have Your hadoop cluster set up and You have fair idea of Map Reduce Programming Model. For this You can visit the below blog posts:

Installing Pig:

Step 1: Download the latest stable release of Pig from the Apache website. I have downloaded pig-0.11.0 from : http://pig.apache.org/releases.html#Download. Place the tar file in Your big data folder. I have placed it under /home/hduser/bigdata
Step 2: Untar the tar.gz file and then rename the extracted folder to pig.
tar -xvf pig-0.11.0.tar.gz mv pig-0.11.0 pig
Step 3: Edit Your bashrc file to include PIG_HOME, PIG_CLASSPATH and add the downloaded pig's bin folder to the PATH. This sets the pig related environment variables. You should remember how we had included the hadoop related paths to the bashrc while installing and configuring hadoop.
cd vim .bashrc
Add the export PIG_HOME and export PIG_CLASSPATH at the end after Hadoop related exports that we had added while configuring Hadoop.
export PIG_HOME=/home/hduser/bigdata/pig
export PIG_CLASSPATH=$HADOOP_HOME/conf
Add the PIG_HOME/bin to the system path and bring this export command to the end of all hadoop related exports.
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$PIG_HOME/bin

The Grunt Shell:

Grunt is Pig’s interactive shell. It enables users to enter Pig Latin interactively and provides a shell for users to interact with HDFS. To enter Grunt, invoke Pig without any command options or enter pig -x mapreduce. This will run pig on Your hadoop cluster. If we give pig –x local, Pig is invoked on Your local FileSystem.
So now we enter the Grunt Shell.
Grunt provides command-line history and editing, as well as Tab completion. It does not provide filename completion via the Tab key. Even though it is useful to work interactively but it is not a full-featured shell. We can write all these data flows in external file which having .pig extension

Example Pig Script:

Now lets write a simple pig script. We take the same example as we worked on in our Map Reduce Blog http://atomkpadhy.blogspot.in/2013/08/an-example-java-map-reduce-program.html. We can download the data files from the link provided in te MR blog and upload it to our hdfs cluster if not done so.

cd $PIG_HOME/scripts
vim NYSEYearlyAnalysis.pig

Now we type the below pig script which will join the prices and dividends file and generate the joined report of MaxPrice of the stock for the Year, Min Price of the stock for the Year and Average Dividends of the stock for the year.

prices = load '/user/hduser/NYSE/prices' using PigStorage(',') as (exchange:chararray, symbol:chararray, date:chararray, open:double, high:double, low:double, close:double, volume:long, adj:double);
proj = foreach prices generate symbol,SUBSTRING(date,0,4) as year,high,low;
pricesgrpd = group proj by (symbol,year);
pricesmaxmin = foreach pricesgrpd generate group,MAX(proj.high) as maxhigh,MIN(proj.low) as minlow;
dividends = load '/user/hduser/NYSE/dividends' using PigStorage(',') as (exchange:chararray, symbol:chararray, date:chararray, dividends:double);
proj = foreach dividends generate symbol,SUBSTRING(date,0,4) as year,dividends;
dividendsgrpd = group proj by (symbol,year);
dividendsavg = foreach dividendsgrpd generate group,AVG(proj.dividends);
joind = join pricesmaxmin by group, dividendsavg by group;
store joind into '/user/hduser/NYSE/pigjoin';

Now run the script from the grunt shell with the help of run NYSEYearlyAnalysis.pig or exec NYSEYearlyAnalysis.pig command and this generates the MapReduce jobs and submits on your Hadoop cluster. The actual generation and execution of MR jobs starts when the Pig Engine encounters a dump or store command.
Note that You could have run these commands interactively in the grunt shell one by one and that would have done the same thing.
The above script was just a simple script for testing our pig install. You can refer to the Programming Pig by Alan Gates for learning Pig and its features.
                                   

Popular

Featured

Three Months of Chadhei