Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop to import data from external structured datastores into the Hadoop Distributed File System (HDFS) or related systems such as Hive and HBase. Conversely, you can use Sqoop to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.
While it is sometimes necessary to move the data in real time, it is most often necessary to load or unload data in bulk. Like Pig, Sqoop is a command-line interpreter. You type Sqoop commands into the interpreter and they are executed one at a time.
Four key features are found in Sqoop:
- Bulk import: Sqoop can import individual tables or entire databases into HDFS. The data is stored in the native directories and files in the HDFS file system.
- Direct input: Sqoop can import and map SQL (relational) databases directly into Hive and HBase.
- Data interaction: Sqoop can generate Java classes so that you can interact with the data programmatically.
- Data export: Sqoop can export data directly from HDFS into a relational database using a target table definition based on the specifics of the target database.
Sqoop works by looking at the database you want to import and selecting an appropriate import function for the source data. After it recognizes the input, it then reads the metadata for the table (or database) and creates a class definition of your input requirements.
You can force Sqoop to be very selective so that you get just the columns you are looking for before input rather than doing an entire input and then looking for your data. This can save considerable time. The actual import from the external database to HDFS is performed by a MapReduce job created behind the scenes by Sqoop.
Sqoop is an effective tool for non-programmers. The other important item to note is the reliance on underlying technologies like HDFS and MapReduce. You see this repeatedly throughout the element of the Hadoop ecosystem.
For installation of Sqoop, the pre requisite is a hadoop cluster set up in pseudo distributed or fully distributed mode. For setting this up, You can refer to my earlier tutorials:
Also You need MySQL(or any traditional RDBMS server) installed in your client. You can refer to the installation steps of mysql server on your client in Step 6 of my Hive Install and Configuration Turorial at : Hive Installation and Configuration Tutorial
Once You are done with the above two prerequisites, You can move to the Sqoop Installation steps below.
- Download the latest stable release from http://sqoop.apache.org/. I downloaded sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz and placed in my bigdata directory.
- Extract the contents of the tar.gz file into the scoop directory inside the bigdata folder.
tar -xvf sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz
mv sqoop-1.4.4.bin__hadoop-1.0.0 sqoop
- Modify the .bashrc file to add sqoop home the same way as we had added hadoop/pig/hive/home there. Also we add the sqoop/bin directory to the system path.
- Sqoop does not ship with third party JDBC drivers. You must download them separately and save them to the $SQOOP_HOME/lib/ directory. Download the tar files from the download links given, Untar them and place the jars in the $SQOOP_HOME/lib directory.
To install MySQL JDBC driver: Download the MySQL JDBC driver from http://www.mysql.com/downloads/connector/j/5.1.html
To Install Oracle JDBC Driver: You can download the JDBC Driver from the Oracle website, for example http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html
To Install Microsoft SQL Server JDBC DriverDownload the Microsoft SQL Server JDBC driver from http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774
To Install PostgreSQL JDBC DriverDownload the PostgreSQL JDBC driver from http://jdbc.postgresql.org/download.html
Running Sqoop on your Hadoop InstallationTo test the install, enter command "sqoop help". You get the output as:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
See 'sqoop help COMMAND' for information on a specific command.
To test the migration of a sample MySQL table to HDFS thru sqoop, lets start by creating a sample MySQL table with some data:
cmd> mysql -u username -pLets insert 4 rows into this table called "stocks".
password: "enter your password"
(i have username as "root" and password as "root".)
mysql> create database sqoop;
mysql> use sqoop;
mysql> create table stocks (exchange varchar(50), symbol varchar(50), stock varchar(50), volume bigint(11));
mysql> insert into stocks (exchange, symbol, stock, volume) values ('NYSE', 'GOOG', 'Google', 12000000);Now lets see some of the sqoop commands one by one:
mysql> insert into stocks (exchange, symbol, stock, volume) values ('BSE', 'SBI', 'State Bank of India', 100000);
mysql> insert into stocks (exchange, symbol, stock, volume) values ('BSE', 'INFY', 'Infosys Technologies Ltd', 1000000);
mysql> insert into stocks (exchange, symbol, stock, volume) values ('BSE', 'TCS', 'Tata Consultancy Services', 1500000);
Sample Output: Sqoop 1.4.4codegen
git commit id 050a2015514533bc25f3134a33401470ee9353ad
Compiled by vasanthkumar on Mon Jul 22 20:01:26 IST 2013
sqoop codegen --connect jdbc:mysql://localhost:3306/sqoop --table stocks --username root -P password: "enter your db password for username"The above command will generate the class name called stocks.java in the current directory where the above command issued. The generated code is used to interact the with database records.
sqoop list-databases --connect jdbc:mysql://localhost:3306/ --username root -PThe above command displays the all the databases in the MySQL server.
sqoop list-tables --connect jdbc:mysql://localhost:3306/sqoop --username root -P
The above command will display all the tables under the database name called sqoop.
sqoop eval --connect jdbc:mysql://localhost:3306/sqoop --username root -P -e "select * from stocks limit 10"The above command is good for evaluating the SQL code of a sql statement to be issued against the MySQL database tables. It is always better to check with eval before doing any bulk imports.
sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username root -P --table stocks -m 1 --target-dir /sqoop/stocks
The above command will import the stocks table data into an HDFS directory called /sqoop/stocks.
To check the output of the import command, issue the hadoop fs -ls command to view the file.
hadoop fs -ls /sqoop/stocks/Now issue the -tail command to view the contents of the imported part-m file.
hadoop fs -tail /sqoop/stocks/part-m-00000You see that the data that we had inserted to the MySQL table is present in the file as comma separated values.
sqoop export --connect jdbc:mysql://localhost:3306/sqoop --username root -P --export-dir /sqoop/stocks/ --table stocksThe above command is used to export data from HDFS to MySQL table called nyse, Here MySQL field delimiter ','. If the data in HDFS file is delimited by ',' no need to any changes. If the data in HDFS is delimited by '\t' we have to add another flag like --fields-terminated-by '\t'.
Now when we go to mysql and do a select * on stocks table of sqoop database, we see that we now have 8 rows with 2 sets of 4 duplicate rows. 4 new rows are loaded into this table now with the export command.
sqoop import-all-tables --connect jdbc:mysql://localhost:3306/sqoop --username root -P -m 1 --warehouse-dir /sqoop/The above command will import all the tables of MySQL hadoop database into HDFS directory called /sqoop