Tuesday, September 3, 2013

Installing and Running Pig on Your Hadoop Cluster


Apache Pig provides an engine for executing data flows in parallel on Hadoop.

  1. Pig's infrastructure layer consists of a compiler that produces sequences of MapReduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). 
  2. Pig's language layer currently consists of a textual language called Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, group etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. 

Key Properties of Pig Latin because of which it is so popular esp on Hadoop ecosystem:

  • Ease of programming 
  • Optimization opportunities
  • Extensibility

This post will talk about installing Pig on top of Your hadoop cluster(You can also run it in local mode) and running a pig script. The pre requisite for this post is that You already have Your hadoop cluster set up and You have fair idea of Map Reduce Programming Model. For this You can visit the below blog posts:

Installing Pig:

Step 1: Download the latest stable release of Pig from the Apache website. I have downloaded pig-0.11.0 from : http://pig.apache.org/releases.html#Download. Place the tar file in Your big data folder. I have placed it under /home/hduser/bigdata
Step 2: Untar the tar.gz file and then rename the extracted folder to pig.
tar -xvf pig-0.11.0.tar.gz mv pig-0.11.0 pig
Step 3: Edit Your bashrc file to include PIG_HOME, PIG_CLASSPATH and add the downloaded pig's bin folder to the PATH. This sets the pig related environment variables. You should remember how we had included the hadoop related paths to the bashrc while installing and configuring hadoop.
cd vim .bashrc
Add the export PIG_HOME and export PIG_CLASSPATH at the end after Hadoop related exports that we had added while configuring Hadoop.
export PIG_HOME=/home/hduser/bigdata/pig
export PIG_CLASSPATH=$HADOOP_HOME/conf
Add the PIG_HOME/bin to the system path and bring this export command to the end of all hadoop related exports.
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$PIG_HOME/bin

The Grunt Shell:

Grunt is Pig’s interactive shell. It enables users to enter Pig Latin interactively and provides a shell for users to interact with HDFS. To enter Grunt, invoke Pig without any command options or enter pig -x mapreduce. This will run pig on Your hadoop cluster. If we give pig –x local, Pig is invoked on Your local FileSystem.
So now we enter the Grunt Shell.
Grunt provides command-line history and editing, as well as Tab completion. It does not provide filename completion via the Tab key. Even though it is useful to work interactively but it is not a full-featured shell. We can write all these data flows in external file which having .pig extension

Example Pig Script:

Now lets write a simple pig script. We take the same example as we worked on in our Map Reduce Blog http://atomkpadhy.blogspot.in/2013/08/an-example-java-map-reduce-program.html. We can download the data files from the link provided in te MR blog and upload it to our hdfs cluster if not done so.

cd $PIG_HOME/scripts
vim NYSEYearlyAnalysis.pig

Now we type the below pig script which will join the prices and dividends file and generate the joined report of MaxPrice of the stock for the Year, Min Price of the stock for the Year and Average Dividends of the stock for the year.

prices = load '/user/hduser/NYSE/prices' using PigStorage(',') as (exchange:chararray, symbol:chararray, date:chararray, open:double, high:double, low:double, close:double, volume:long, adj:double);
proj = foreach prices generate symbol,SUBSTRING(date,0,4) as year,high,low;
pricesgrpd = group proj by (symbol,year);
pricesmaxmin = foreach pricesgrpd generate group,MAX(proj.high) as maxhigh,MIN(proj.low) as minlow;
dividends = load '/user/hduser/NYSE/dividends' using PigStorage(',') as (exchange:chararray, symbol:chararray, date:chararray, dividends:double);
proj = foreach dividends generate symbol,SUBSTRING(date,0,4) as year,dividends;
dividendsgrpd = group proj by (symbol,year);
dividendsavg = foreach dividendsgrpd generate group,AVG(proj.dividends);
joind = join pricesmaxmin by group, dividendsavg by group;
store joind into '/user/hduser/NYSE/pigjoin';

Now run the script from the grunt shell with the help of run NYSEYearlyAnalysis.pig or exec NYSEYearlyAnalysis.pig command and this generates the MapReduce jobs and submits on your Hadoop cluster. The actual generation and execution of MR jobs starts when the Pig Engine encounters a dump or store command.
Note that You could have run these commands interactively in the grunt shell one by one and that would have done the same thing.
The above script was just a simple script for testing our pig install. You can refer to the Programming Pig by Alan Gates for learning Pig and its features.
                                   

No comments:

Post a Comment

Popular

Featured

Three Months of Chadhei