Now that You know how to set up a pseudo distributed Hadoop cluster, You can do the below steps to build your multimode cluster. I will mention the process of setting up a two node cluster here with a "master" and a "slave" machine. The master is where the name node and job tracker runs and is the single point of interaction and failure in the cluster. The slaves run data node and task tracker and act as per the direction of the hdfs and map reduce master. You can use this process to scale up to as many nodes as You want.
1. Pick Your 2nd computer which You want to make as "the slave" in the cluster. We will call it "slave". Find out its ip address issuing the ifconfig command in the terminal. If you are in a wifi LAN to set up your cluster and if Your ip address is set to DHCP and changes frequently everytime You login, then You may consider setting up a static ip address as done for the master node in my previous tutorial. Click here to go to the blog to set up static ip address.
2. Do steps 1, 2, 3, 4, 5, 6 of pseudo-distributed cluster set-up process blog on the slave machine. The hostname on this machine should be set as "slave".
3. Define the slave machine in the /etc/hosts file of the master and vice versa.
4. scp the slave machine's id_rsa.pub key generated in step 5, to the master machine.
On Slave: $ scp -r .ssh/id_rsa.pub hduser@master:/home/hduser/
It will ask you for the password of master machine's hduser account at this point.
Now Concatenate this id_rsa.pub of slave to the authorized_keys file of master.
On Master: cat $HOME/id_rsa.pub >> $HOME/.ssh/authorized_keys
On Master: $ rm -rf $HOME/id_rsa.pub
Now SCP the authorized_keys file from master to slave.
On Master: $ scp -r .ssh/authorized_keys hduser@slave:/home/hduser/.ssh/
It will ask you for a password of the slave hduser account at this point.
After this step, the passwordless ssh communication is set up between your master and slave machine using hduser account. You can test it out by using the following commands
On slave(logging into hduser account): $ ssh master
On master(logging onto hduser account): $ssh slave
SSH login to the other system won't need a password now and the two systems can talk to each other.
5. Now on the master machine where you had set up your pseudo cluster, make the below changes:
- Add "slave" to the hadoop/conf/slaves file. Let master continue to be listed there, so that a data node and a task tracker runs on the master machine too. The slave nodes have to mentioned in the slaves file, one slave per line.
- Change the dfs.replication property in hdfs-site.xml file from 1 to 2. So 2 copies of each block is going to be stored in the cluster redundantly.
6. Now scp the entire /home/hduser/bigdata/hadoop/ directory from master to slave computer.
On Master: $ scp -r bigdata/hadoop/* hduser@slave:/home/hduser/bigdata/hadoop/
Note that the absolute path of the hadoop home should remain same in both machines. Also the below directory structure should be present in the slave as in the master:
- bigdata (our main bigdata projects related directory)
- hadoop (Hadoop App Folder. This one is scp 'ed from the master.)
- hadoopdata (data directory)
- name (dfs.name.dir points to this directory)
- data (dfs.data.dir points to this directory)
- tmp (hadoop.tmp.dir points to this directory)
7. Clear the masters and slaves file in the slave machine.
8. Now clear the $HOME/bigdata/hadoopdata/data directory in both the machines.
$ rm -rf $HOME/bigdata/hadoop/data
9. Format the namenode on master machine:
On Master: $ hadoop namenode -format
Now Your cluster is ready to hadoop to run in fully distributed mode. You can run the start-all.sh in your master machine. It will start the name node, data node, job tracker, secondary name node, task tracker daemons in the master and start a data node and a task tracker in the slave machine. You can check the url master:50070 for name node administration and master:50030 for map reduce administration. After You are done with your hadoop work, dont forget to issue the stop-all.sh on the master node to stop the daemons running on the cluster.