Submitting Sqoop jobs in parallel

Jack Arenas Mon, 02 Mar 2015 12:27:44 -0800

Hi team,

I'm building an ETL tool that requires me to pull in a bunch of tables from a 
db into HDFS and I'm currently doing this sequentially using Sqoop. I figured 
it might be a faster to submit the Sqoop jobs in parallel, that is with a 
predefined thread pool (currently trying 8) because it took about two hours to 
ingest 150 tables of various sizes, frankly not very big tables as this is POC. 
So sequentially this works fine, but as soon as I add parallelism, roughly 75% 
of my Sqoop jobs fail, and I'm not saying that they don't ingest any data, 
simply that the data gets stuck in the staging area (I.e /user/username) as 
opposed to the proper hive table (I.e /user/username/Hive/Lab). Has anyone 
experienced this before? I figure I may be able to shoot a separate process 
that moves the hive tables from the staging area into the hive table area, but 
I'm not sure if that process would simply be to move the tables or if there is 
more involved.


Thanks!

Specs: HDP 2.1, Sqoop 1.4.4.2

Cheers,
Jack

Submitting Sqoop jobs in parallel

Reply via email to