Wednesday, July 31, 2013

Configuring Eclipse to run Map/Reduce program on Your Hadoop Cluster

After setting up your hadoop cluster which we learnt in our earlier blogs( Pseudo Distributed Cluster, Fully Distributed Cluster), we now set up eclipse on ubuntu to develop Map Reduce programs for our Hadoop Cluster. Below are the steps to do so:

  • Install Eclipse on ubuntu, by using $ sudo apt-get eclipse.
  • Next step is to download the Hadoop Eclipse Plugin. Many tutorials suggest to build the plugin jar file Yourself using "ant". Trust me, this method is a quite messy and You may screwup Your hadoop install changing the various build.xml and build-properties.xml files. The best way is to directly download this jar file from here by clicking this line.
  • Next copy the jar file to eclipse/plugins directory. Its going to be /usr/lib/eclipse/plugins in my case.
  • Start eclipse, go to "Window >> Open Perspective >> Other". From perspectives window, you should see “Map/Reduce”, Select it and click "OK".
  • You will see “Map/Reduce” perspective icon at the top right hand corner of the main eclipse panel now, as highlighted below. 

  • Also You see the Map/Reduce Locations tab at the bottom. Go to that tab and add a new location as shown below.

  • In the details form, give any Location Name.Under the Map/Reduce Master section, fill in Host as your cluster's name node host name. In our case, its "master". The port is 9001. Under DFS Master, check the "Use M/R Master host" option and give port as 9000. Under User Name give Your hadoop install user name. Its "hduser" in our case. Now click finish.


  • Now you should see Your newly set-up DFS Locations in project explorer.

Now you are all set with environment for development.

Thursday, July 25, 2013

Installing and Configuring Hadoop in Fully Distributed Mode

For setting up a multi-node cluster, You first need to learn and understand setting up a pseudo distributed cluster. You can go thru my earlier blog by clicking in here to learn this process. 

Now that You know how to set up a pseudo distributed Hadoop cluster, You can do the below steps to build your multimode cluster. I will mention the process of setting up a two node cluster here with a "master" and a "slave" machine. The master is where the name node and job tracker runs and is the single point of interaction and failure in the cluster. The slaves run data node and task tracker and act as per the direction of the hdfs and map reduce master. You can use this process to scale up to as many nodes as You want.

1. Pick Your 2nd computer which You want to make as "the slave" in the cluster. We will call it "slave". Find out its ip address issuing the ifconfig command in the terminal. If you are in a wifi LAN to set up your cluster and if Your ip address is set to DHCP and changes frequently everytime You login, then You may consider setting up a static ip address as done for the master node in my previous tutorial. Click here to go to the blog to set up static ip address.

2. Do steps 1, 2, 3, 4, 5, 6 of pseudo-distributed cluster set-up process blog on the slave machine. The hostname on this machine should be set as "slave".

3. Define the slave machine in the /etc/hosts file of the master and vice versa.

4. scp the slave machine's id_rsa.pub key generated in step 5, to the master machine.
On Slave: $ scp -r .ssh/id_rsa.pub hduser@master:/home/hduser/
It will ask you for the password of master machine's hduser account at this point.
Now Concatenate this id_rsa.pub of slave to the authorized_keys file of master.
On Master: cat $HOME/id_rsa.pub >> $HOME/.ssh/authorized_keys
On Master: $ rm -rf $HOME/id_rsa.pub
Now SCP the authorized_keys file from master to slave.
On Master:  $ scp -r .ssh/authorized_keys hduser@slave:/home/hduser/.ssh/
It will ask you for a password of the slave hduser account at this point.
After this step, the passwordless ssh communication is set up between your master and slave machine using hduser account. You can test it out by using the following commands
On slave(logging into hduser account): $ ssh master 
On master(logging onto hduser account): $ssh slave
SSH login to the other system won't need a password now and the two systems can talk to each other.

5. Now on the master machine where you had set up your pseudo cluster, make the below changes:
  • Add "slave" to the hadoop/conf/slaves file. Let master continue to be listed there, so that a data node and a task tracker runs on the master machine too. The slave nodes have to mentioned in the slaves file, one slave per line.
  • Change the dfs.replication property in hdfs-site.xml file from 1 to 2. So 2 copies of each block is going to be stored in the cluster redundantly.

6. Now scp the entire /home/hduser/bigdata/hadoop/ directory from master to slave computer. 
On Master: $ scp -r bigdata/hadoop/* hduser@slave:/home/hduser/bigdata/hadoop/ 
Note that the absolute path of the hadoop home should remain same in both machines. Also the below directory structure should be present in the slave as in the master:

- bigdata (our main bigdata projects related directory)
      - hadoop (Hadoop App Folder. This one is scp 'ed from the master.)
      - hadoopdata (data directory)
         - name (dfs.name.dir points to this directory)
         - data (dfs.data.dir points to this directory)
         - tmp  (hadoop.tmp.dir points to this directory)

7. Clear the masters and slaves file in the slave machine.

8. Now clear the $HOME/bigdata/hadoopdata/data directory in both the machines.
$ rm -rf $HOME/bigdata/hadoop/data 

9. Format the namenode on master machine:
On Master: $ hadoop namenode -format

Now Your cluster is ready to hadoop to run in fully distributed mode. You can run the start-all.sh in your master machine. It will start the name node, data node, job tracker, secondary name node, task tracker daemons in the master and start a data node and a task tracker in the slave machine. You can check the url master:50070 for name node administration and master:50030 for map reduce administration. After You are done with your hadoop work, dont forget to issue the stop-all.sh on the master node to stop the daemons running on the cluster.

Saturday, July 20, 2013

Installing and Configuring Hadoop on Ubuntu as a Pseudo Distributed Single Node Cluster


Briefly what is psuedo-distributed single node Hadoop cluster:

  • It is a Hadoop distributed cluster having Single Node.
  • All the five daemons (NN,DN,JT,TT,SNN) are running as individual JVM (java process) instances.
  • We will get all distributed features of Hadoop.
  • This mode is used to develop, test, and staging (before production) Map Reduce programs quickly.
  • It writes all the system, daemon and user logs to different files on the cluster
  • To install Hadoop in this mode, we have to modify the hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml.
  • hadoop-env.sh contains all the Environment variables which are required for running Hadoop. All the default values are fine, Only we have to modify the JAVA_HOME=“The location of Java Home Directory”
  • Core-site.xml contains all the cluster specific information like io, temp directory, sorting, file system uri etc.
  • Hdfs-site.xml contains all the parameters which are specific to HDFS like name node directory, data node directory, replication factor etc.
  • Mapred-site.xml contanins all the parameters which are specific to Map Reduce like Jobtracker uri, intermediate data location, concurrent map/reduce tasks etc.
  • Note: (Optional) Modify masters and slaves to change the domain name service of the machine (default localhost it works). There is no need to change machine dns name.

Below is the complete step by step process of setting up a pseudo distributed single node Hadoop cluster on top of Ubuntu 12.04.Click here to visit my blog on setting up a fully distributed two node Hadoop cluster.

1. Installing ubuntu:

  • Download the ubuntu 12.xx or 13.xx iso image from  http://www.ubuntu.com/download/desktop.
  • Using PowerISO or USB Installer, convert the ISO image to bootable USB for installing ubuntu.
  • Reboot the system with USB drive having higher priority in boot order and install ubuntu along side your local OS or in a separate partition.
  • Now login to ubuntu and proceed with the next steps.

2. Installing Java:

You can install either open jdk or oracle jdk in your ubuntu installation. I have installed oracle jdk here. Below commands tell you the process to install java in ubuntu.
$ sudo apt-get update 
- This downloads the urls from the global repository to local machine.
$ sudo apt-cache search jdk 
- This is needed if you dont know the exact package name to be installed and want to search for all related packages.
$ sudo apt-get install oracle-java7-installer
 - This installs java-7 on the system.
$ sudo update-java-alternatives -s java-7-oracle
The full JDK which will be placed in /usr/lib/jvm/java-7-oracle  (well, this directory is actually a symlink on Ubuntu). After installation, make a quick check whether Oracle’s JDK is correctly set up:
$ java -version

3. Create a user called hduser added to hadoop usergroup:

We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

4. Edit the hostname and hosts file to define the dns of the computer and other nodes.

$ sudo gedit /etc/hostname
 - Change the host name to “master” in this file. This will name your computer as “master” in the network.
$ ifconfig
 - Do this and get the inet address
$ sudo gedit /etc/hosts
Add the inet address that you got from ifconfig and add it as master. This is defining the dns names for your node in cluster. Restart Your machine after this step.
Note that if the inet address of the machine changes everytime You log into your network because of DHCP settings, then you may consider assigning a static ip to your computer in your network. Click here to visit my blog on setting up of static ip address in ubuntu.


5. Install ssh and run ssh keygen:

For running hadoop in pseudo distributed and fully distributed mode, we need passwordless ssh communication to be active and working as hadoop does an ssh into localhost or other nodes for execution.
$ sudo apt-get install openssh-server
 - This installs the open ssh server
$ su – hduser
 - Change to login to terminal session using id hduser
$ ssh-keygen -t rsa -P ""
 - This generates the public private key pair and You can see two files id_rsa and id_rsa.pub in the .ssh folder in the home directory of hduser. Note that the password used here is not set and is supressed by giving -P “”.
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
 - This concatenates the newly generated public key to the authorized_keys file.
The final step in ssh install and configuration process is to test the SSH setup by connecting to your local machine with the hduser user. The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).
$ su hduser
$ ssh master

6. Edit the .bashrc file

Every user has his own .bashrc file where the environment variables need to be defined so that these variables are set automatically when session starts. Edit the .bashrc file which is in the home directory.
$ su hduser
$ gedit .bashrc
Add the following three export commands to define the environment variables. The three exports should be added at the end.
# Set Hadoop-related environment variables 
export HADOOP_HOME=/home/hduser/bigdata/hadoop 
export JAVA_HOME=/usr/lib/jvm/java-7-oracle 
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin

7. Download and extract Hadoop from apache.org website.

Create a directory in hduser's home directory called bigdata.
$ su hduser
$ cd $HOME
$ mkdir -p bigdata
$ cd bigdata
Download the tar.gz file from apache hadoop download website http://hadoop.apache.org/releases.html#Download. I downloaded the hadoop-1.1.2.tar.gz file. And saved it in the bigdata folder. Download from the suggested mirror in the website and download a stable version/release.
Now extract the contents in the tar file by issuing the command:
$ tar xvf hadoop-1.1.2.tar.gz
Now I want to rename the hadoop-1.1.2 folder to just hadoop
$ mv hadoop-1.1.2 hadoop
Now set the permission on the hadoop folder and its subdirectories and files recursively using the below command:
$ sudo chown -R hduser $HOME/bigdata/hadoop
$ sudo chmod -R 755 $HOME/bigdata/hadoop
Now we need to create and set up permissions of the name, tmp and data directories for our hadoop install. These three folders reside in a directory called hadoopdata inside our bigdata folder. For this, Run the below commands:
$ mkdir -p hadoopdata
$ cd hadoopdata
$ mkdir name
$ mkdir data
$ mkdir tmp
$ sudo chmod -R 755 $HOME/hadoop/hadoopdata

So at the end of step 7 we have built the following directory structure for our hadoop installation and configuration.

$HOME (hduser's home directory)
   - bigdata (our main bigdata projects related directory)
      - hadoop (Apache Hadoop Application)
      - hadoopdata (data directory)
         - name (dfs.name.dir points to this directory)
         - data (dfs.data.dir points to this directory)
         - tmp  (hadoop.tmp.dir points to this directory)

8. Configuring Hadoop:

Hadoop has 1000+ configuration parameters for changing its runtime behaviours. All these parameters have default values. For setting up our hadoop installation to work in various modes and environments, we override the default configuration parms in any of the 6 configuration files that are located in the $HOME/bigdata/hadoop/conf directory.
For setting up of a single node cluster or a pseudo distributed environment, we have to make the following changes to in the configuration files:

hadoop-env.sh --> For setting, hadoop environment variables

In this file set your $JAVA_HOME as we set it in our .bashrc file.

$ vim $HOME/bigdata/hadoop/conf/hadoop-env.sh
Make sure this line is present and uncommented to set the java home variable.
export JAVA_HOME=/usr/lib/jvm/java-7-oracle

core-site.xml --> For setting, Hadoop cluster Information related configuration properties

Add the below lines to set the two properties – hadoop.tmp.dir and fs.default.name in the core-site.xml file. This code is added between the <configuration> and </configuration>  tags in the xml file.
 <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/hduser/bigdata/hadoopdata/tmp</value>
    <description>A base for other temporary directories.</description>
</property>

<property>

     <name>fs.default.name</name>
     <value>hdfs://master:9000</value>
     <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
</property>


hdfs-site.xml --> For setting, HDFS related configuration properties

Here you set the three properties – dfs.replication, dfs.name.dir and dfs.data.dir. Since we are now working on a single node pseudo distributed cluster, so set the replication factor as 1. Note that the default replication factor value for hadoop is 3.

Also the name and data directory by default is the tmp folder which is not recommended to be used mainly because of two reasons:
  • The tmp folder is not permanent and gets cleared on restart.
  • There is a storage space limitation to the tmp folder.
So we need to set the name and data directory as property override in the hdfs-site.xml file.
<property>

     <name>dfs.replication</name>
     <value>1</value>
     <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description>
</property>

<property>

   <name>dfs.name.dir</name> 
   <value>/home/hduser/bigdata/hadoopdata/name</value>
   <description> Namenode meta information directory</description>
</property>

<property>

   <name>dfs.data.dir</name>
   <value>/home/hduser/bigdata/hadoopdata/data</value>
   <description>Actual Data Location Directory</description>
</property>


mapered-site.xml --> For setting, Map Reduce related configuration properties

We need to set the mapred.job.tracker property here.

<property>
     <name>mapred.job.tracker</name>
     <value>master:9001</value>
     <description>The host and port that the MapReduce job tracker runs  at.  If "local", then jobs are run in-process as a single map and reduce task. </description>
</property>

Now edit the masters and slaves files to list the host name in there.

slaves --> All domain names (IP info) of slave nodes (Data Node + Task Tracker)
masters --> Domain name of Secondary Name Node
So I change both the files to have the hostname as “master” in this case of pseudo distributed mode of running hadoop.

9. Formatting the namenode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.
Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:
$ hadoop namenode -format
This will create the directory structure of the name folder which you can browse thru to check.

10. Starting Hadoop cluster

  • Run the command start-all.sh and you should see all the 5 daemons starting i.e Name Node, Secondary Name Node, Job Tracker, Task Tracker and datanode.
  • You can run $ jps command to see these daemons running.
  • open browser and go to address http://master:50070. This is Your name node administration web interface.
  • Go to addresss http://master:50030. This is Your mapreduce admin web interface.
  • Now stop all these daemons by issuing the stop-all.sh command.

Click here to visit my blog on setting up a fully distributed two node Hadoop cluster.

Sunday, July 14, 2013

Shared folder between Your ubuntu VM and Native OS.

Very Frequently You would need files to be shared between Your local OS and Your ubuntu VM running using Oracle VirtualBox. Here are the steps to set up Your Shared Folder on Ubuntu Virtual Box VM.

  • Start Your Ubuntu VM in Oracle Virtual Box.
  • Go to Devices >> Install Guest Additions. This will install the Guest additions in a terminal session.
  • Next reboot Your VM.
  • Create a Folder in Your native OS which You want to share. I have named it as “Shared”.
  • Now in Your VM, Go to Devices >> Shared Folders and then hit the “Add” button.
  • Browse for the folder in Your native OS that You recently created to be shared.
  • Now In Your VM system, create a folcer to be Shared in Your home directory. Get the full path of this folder. The one that I have created is : /home/atom/Shared. Check the option Make Permanent while doing so.
  • Now open Terminal in Your VM and issue the command : sudo mount -t vboxsf Shared /home/atom/Shared. Note that You would have to run this command to mount this shared drive everytime You restart your VM. So save it in a text file or as a shell script to run it after rebooting.

Popular

Featured

Three Months of Chadhei