Well I am left to use Spark for importing data from RDBMS table to Hadoop. You may argue why and it is because Spark does it in one process and no errors
With sqoop I am getting this error message which leaves the RDBMS table data on HDFS file but stops there. 2016-09-21 21:00:15,084 [myid:] - INFO [main:OraOopLog@103] - Data Connector for Oracle and Hadoop is disabled. 2016-09-21 21:00:15,095 [myid:] - INFO [main:SqlManager@98] - Using default fetchSize of 1000 2016-09-21 21:00:15,095 [myid:] - INFO [main:CodeGenTool@92] - Beginning code generation SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-client.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-hive.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-thin-client.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hduser/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 2016-09-21 21:00:15,681 [myid:] - INFO [main:OracleManager@417] - Time zone has been set to GMT 2016-09-21 21:00:15,717 [myid:] - INFO [main:SqlManager@757] - Executing SQL statement: select * from sh.sales where (1 = 0) 2016-09-21 21:00:15,727 [myid:] - INFO [main:SqlManager@757] - Executing SQL statement: select * from sh.sales where (1 = 0) 2016-09-21 21:00:15,748 [myid:] - INFO [main:CompilationManager@94] - HADOOP_MAPRED_HOME is /home/hduser/hadoop-2.7.3/share/hadoop/mapreduce Note: /tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. *2016-09-21 21:00:17,354 [myid:] - INFO [main:CompilationManager@330] - Writing jar file: /tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.jar*2016-09-21 21:00:17,366 [myid:] - INFO [main:ImportJobBase@237] - Beginning query import. 2016-09-21 21:00:17,511 [myid:] - WARN [main:NativeCodeLoader@62] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2016-09-21 21:00:17,516 [myid:] - INFO [main:Configuration@840] - mapred.jar is deprecated. Instead, use mapreduce.job.jar 2016-09-21 21:00:17,993 [myid:] - INFO [main:Configuration@840] - mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 2016-09-21 21:00:18,094 [myid:] - INFO [main:RMProxy@56] - Connecting to ResourceManager at rhes564/50.140.197.217:8032 2016-09-21 21:00:23,441 [myid:] - INFO [main:DBInputFormat@192] - Using read commited transaction isolation 2016-09-21 21:00:23,442 [myid:] - INFO [main:DataDrivenDBInputFormat@147] - BoundingValsQuery: SELECT MIN(prod_id), MAX(prod_id) FROM (select * from sh.sales where (1 = 1) ) t1 2016-09-21 21:00:23,540 [myid:] - INFO [main:JobSubmitter@394] - number of splits:4 2016-09-21 21:00:23,547 [myid:] - INFO [main:Configuration@840] - mapred.job.name is deprecated. Instead, use mapreduce.job.name 2016-09-21 21:00:23,547 [myid:] - INFO [main:Configuration@840] - mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps 2016-09-21 21:00:23,547 [myid:] - INFO [main:Configuration@840] - mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 2016-09-21 21:00:23,547 [myid:] - INFO [main:Configuration@840] - mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 2016-09-21 21:00:23,547 [myid:] - INFO [main:Configuration@840] - mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 2016-09-21 21:00:23,548 [myid:] - INFO [main:Configuration@840] - mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 2016-09-21 21:00:23,548 [myid:] - INFO [main:Configuration@840] - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 2016-09-21 21:00:23,548 [myid:] - INFO [main:Configuration@840] - mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files 2016-09-21 21:00:23,548 [myid:] - INFO [main:Configuration@840] - mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 2016-09-21 21:00:23,548 [myid:] - INFO [main:Configuration@840] - mapred.job.classpath.files is deprecated. Instead, use mapreduce.job.classpath.files 2016-09-21 21:00:23,548 [myid:] - INFO [main:Configuration@840] - user.name is deprecated. Instead, use mapreduce.job.user.name 2016-09-21 21:00:23,548 [myid:] - INFO [main:Configuration@840] - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 2016-09-21 21:00:23,549 [myid:] - INFO [main:Configuration@840] - mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes 2016-09-21 21:00:23,549 [myid:] - INFO [main:Configuration@840] - mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 2016-09-21 21:00:23,656 [myid:] - INFO [main:JobSubmitter@477] - Submitting tokens for job: job_1474455325627_0045 2016-09-21 21:00:23,955 [myid:] - INFO [main:YarnClientImpl@174] - Submitted application application_1474455325627_0045 to ResourceManager at rhes564/50.140.197.217:8032 2016-09-21 21:00:23,980 [myid:] - INFO [main:Job@1272] - The url to track the job: http://http://rhes564:8088/proxy/application_1474455325627_0045/ 2016-09-21 21:00:23,981 [myid:] - INFO [main:Job@1317] - Running job: job_1474455325627_0045 2016-09-21 21:00:31,180 [myid:] - INFO [main:Job@1338] - Job job_1474455325627_0045 running in uber mode : false 2016-09-21 21:00:31,182 [myid:] - INFO [main:Job@1345] - map 0% reduce 0% 2016-09-21 21:00:40,260 [myid:] - INFO [main:Job@1345] - map 25% reduce 0% 2016-09-21 21:00:44,283 [myid:] - INFO [main:Job@1345] - map 50% reduce 0% 2016-09-21 21:00:48,308 [myid:] - INFO [main:Job@1345] - map 75% reduce 0% 2016-09-21 21:00:55,346 [myid:] - INFO [main:Job@1345] - map 100% reduce 0% *2016-09-21 21:00:56,359 [myid:] - INFO [main:Job@1356] - Job job_1474455325627_0045 completed successfully2016-09-21 21:00:56,501 [myid:] - ERROR [main:ImportTool@607] - Imported Failed: No enum constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS* Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 21 September 2016 at 20:56, Michael Segel <[email protected]> wrote: > Uhmmm… > > A bit of a longer-ish answer… > > Spark may or may not be faster than sqoop. The standard caveats apply… > YMMV. > > The reason I say this… you have a couple of limiting factors. The main > one being the number of connections allowed with the target RDBMS. > > Then there’s the data distribution within the partitions / ranges in the > database. > By this, I mean that using any parallel solution, you need to run copies > of your query in parallel over different ranges within the database. Most > of the time you may run the query over a database where there is even > distribution… if not, then you will have one thread run longer than the > others. Note that this is a problem that both solutions would face. > > Then there’s the cluster itself. > Again YMMV on your spark job vs a Map/Reduce job. > > In terms of launching the job, setup, etc … the spark job could take > longer to setup. But on long running queries, that becomes noise. > > The issue is what makes the most sense to you, where do you have the most > experience, and what do you feel the most comfortable in using. > > The other issue is what do you do with the data (RDDs,DataSets, Frames, > etc) once you have read the data? > > > HTH > > -Mike > > PS. I know that I’m responding to an earlier message in the thread, but > this is something that I’ve heard lots of questions about… and its not a > simple thing to answer… Since this is a batch process. The performance > issues are moot. > > On Aug 24, 2016, at 5:07 PM, Mich Talebzadeh <[email protected]> > wrote: > > Personally I prefer Spark JDBC. > > Both Sqoop and Spark rely on the same drivers. > > I think Spark is faster and if you have many nodes you can partition your > incoming data and take advantage of Spark DAG + in memory offering. > > By default Sqoop will use Map-reduce which is pretty slow. > > Remember for Spark you will need to have sufficient memory > > HTH > > Dr Mich Talebzadeh > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > http://talebzadehmich.wordpress.com > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 24 August 2016 at 22:39, Venkata Penikalapati < > [email protected]> wrote: > >> Team, >> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms. >> Sqoop has lot of optimizations to fetch data does spark jdbc also has those >> ? >> >> I'm performing few analytics using spark data for which data is residing >> in rdbms. >> >> Please guide me with this. >> >> >> Thanks >> Venkata Karthik P >> >> > >
