Friday, November 1, 2013

Setting up and Running Sqoop on our Hadoop Installation.

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop to import data from external structured datastores into the Hadoop Distributed File System (HDFS) or related systems such as Hive and HBase. Conversely, you can use Sqoop to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.
While it is sometimes necessary to move the data in real time, it is most often necessary to load or unload data in bulk. Like Pig, Sqoop is a command-line interpreter. You type Sqoop commands into the interpreter and they are executed one at a time.

Four key features are found in Sqoop:
  • Bulk import: Sqoop can import individual tables or entire databases into HDFS. The data is stored in the native directories and files in the HDFS file system.
  • Direct input: Sqoop can import and map SQL (relational) databases directly into Hive and HBase.
  • Data interaction: Sqoop can generate Java classes so that you can interact with the data programmatically.
  • Data export: Sqoop can export data directly from HDFS into a relational database using a target table definition based on the specifics of the target database.

Sqoop works by looking at the database you want to import and selecting an appropriate import function for the source data. After it recognizes the input, it then reads the metadata for the table (or database) and creates a class definition of your input requirements.
You can force Sqoop to be very selective so that you get just the columns you are looking for before input rather than doing an entire input and then looking for your data. This can save considerable time. The actual import from the external database to HDFS is performed by a MapReduce job created behind the scenes by Sqoop.
Sqoop is an effective tool for non-programmers. The other important item to note is the reliance on underlying technologies like HDFS and MapReduce. You see this repeatedly throughout the element of the Hadoop ecosystem.

Installation:

Pre-Requisites:
For installation of Sqoop, the pre requisite is a hadoop cluster set up in pseudo distributed or fully distributed mode. For setting this up, You can refer to my earlier tutorials:
Also You need MySQL(or any traditional RDBMS server) installed in your client. You can refer to the installation steps of mysql server on your client in Step 6 of my Hive Install and Configuration Turorial at : Hive Installation and Configuration Tutorial

Once You are done with the above two prerequisites, You can move to the Sqoop Installation steps below.

  • Download the latest stable release from http://sqoop.apache.org/. I downloaded sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz  and placed in my bigdata directory.
  • Extract the contents of the tar.gz file into the scoop directory inside the bigdata folder.

cd bigdata
tar -xvf sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz
mv sqoop-1.4.4.bin__hadoop-1.0.0 sqoop

  • Modify the .bashrc file to add sqoop home the same way as we had added hadoop/pig/hive/home there. Also we add the sqoop/bin directory to the system path.

vim ~/.bashrc
export HADOOP_HOME=/home/hduser/bigdata/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export PIG_HOME=/home/hduser/bigdata/pig
export PIG_CLASSPATH=$HADOOP_HOME/conf
export HIVE_HOME=/home/hduser/bigdata/hive
export SQOOP_HOME=/home/hduser/bigdata/sqoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$PIG_HOME/bin:$HIVE_HOME/bin:$SQOOP_HOME/bin

  • Sqoop does not ship with third party JDBC drivers. You must download them separately and save them to the $SQOOP_HOME/lib/ directory. Download the tar files from the download links given, Untar them and place the jars in the $SQOOP_HOME/lib directory.

To install MySQL JDBC driver: Download the MySQL JDBC driver from http://www.mysql.com/downloads/connector/j/5.1.html
To Install Oracle JDBC Driver: You can download the JDBC Driver from the Oracle website, for example http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html
To Install Microsoft SQL Server JDBC DriverDownload the Microsoft SQL Server JDBC driver from http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774
To Install PostgreSQL JDBC DriverDownload the PostgreSQL JDBC driver from http://jdbc.postgresql.org/download.html

Running Sqoop on your Hadoop Installation

To test the install, enter command "sqoop help". You get the output as:
Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.

To test the migration of a sample MySQL table to HDFS thru sqoop, lets start by creating a sample MySQL table with some data:
cmd> mysql -u username -p
password: "enter your password"
(i have username as "root" and password as "root".)
mysql> create database sqoop;
mysql> use sqoop;
mysql> create table stocks (exchange varchar(50), symbol varchar(50), stock varchar(50), volume bigint(11));
Lets insert 4 rows into this table called "stocks".
mysql> insert into stocks (exchange, symbol, stock, volume) values ('NYSE', 'GOOG', 'Google', 12000000);
mysql> insert into stocks (exchange, symbol, stock, volume) values ('BSE', 'SBI', 'State Bank of India', 100000);
mysql> insert into stocks (exchange, symbol, stock, volume) values ('BSE', 'INFY', 'Infosys Technologies Ltd', 1000000);
mysql> insert into stocks (exchange, symbol, stock, volume) values ('BSE', 'TCS', 'Tata Consultancy Services', 1500000);
mysql>quit;
Now lets see some of the sqoop commands one by one:
version
sqoop version
Sample Output:    Sqoop 1.4.4
git commit id 050a2015514533bc25f3134a33401470ee9353ad
Compiled by vasanthkumar on Mon Jul 22 20:01:26 IST 2013
codegen
sqoop codegen --connect jdbc:mysql://localhost:3306/sqoop --table stocks --username root -P password: "enter your db password for username"
The above command will generate the class name called stocks.java in the current directory where the above command issued. The generated code is used to interact the with database records.
 
list-databases
sqoop list-databases --connect jdbc:mysql://localhost:3306/ --username root -P
The above command displays the all the databases in the MySQL server.
 
list-tables 
 
sqoop list-tables --connect jdbc:mysql://localhost:3306/sqoop --username root -P
 
The above command will display all the tables under the database name called sqoop.
 

eval 
sqoop eval --connect jdbc:mysql://localhost:3306/sqoop --username root -P -e "select * from stocks limit 10"
The above command is good for evaluating the SQL code of a sql statement to be issued against the MySQL database tables. It is always better to check with eval before doing any bulk imports.

import
 
sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username root -P --table stocks -m 1 --target-dir /sqoop/stocks
 
The above command will import the stocks table data into an HDFS directory called /sqoop/stocks.

To check the output of the import command, issue the hadoop fs -ls command to view the file.
hadoop fs -ls /sqoop/stocks/
Now issue the -tail command to view the contents of the imported part-m file.
hadoop fs -tail /sqoop/stocks/part-m-00000
You see that the data that we had inserted to the MySQL table is present in the file as comma separated values.

export
sqoop export --connect jdbc:mysql://localhost:3306/sqoop --username root -P --export-dir /sqoop/stocks/ --table stocks
The above command is used to export data from HDFS to MySQL table called nyse, Here MySQL field delimiter ','. If the data in HDFS file is delimited by ',' no need to any changes. If the data in HDFS is delimited by '\t' we have to add another flag like --fields-terminated-by '\t'.

Now when we go to mysql and do a select * on stocks table of sqoop database, we see that we now have 8 rows with 2 sets of 4 duplicate rows. 4 new rows are loaded into this table now with the export command.

import-all-tables
sqoop import-all-tables --connect jdbc:mysql://localhost:3306/sqoop --username root -P -m 1  --warehouse-dir /sqoop/
The above command will import all the tables of MySQL hadoop database into HDFS directory called /sqoop

3 comments:

  1. The article provided by you is very nice and it is very helpful to know the more information.keep update Big Data Hadoop Online Course Bangalore

    ReplyDelete
  2. very nice blog,keep sharing more blogs with us.

    Thank you....
    hadoop admin course

    ReplyDelete
  3. Are you looking for places to eat Indian food nearby? Here you can explore the best Indian restaurants and cuisines near your location.Order now.

    https://www.adeenaskitchen.co.uk/menu

    ReplyDelete

Popular

Featured

Three Months of Chadhei