Re: Sqoop vs spark jdbc

Mich Talebzadeh Wed, 21 Sep 2016 13:21:47 -0700

Well I am left to use Spark for importing data from RDBMS table to Hadoop.

You may argue why and it is because Spark does it in one process and no
errors


With sqoop I am getting this error message which leaves the RDBMS table
data on HDFS file but stops there.

2016-09-21 21:00:15,084 [myid:] - INFO  [main:OraOopLog@103] - Data
Connector for Oracle and Hadoop is disabled.
2016-09-21 21:00:15,095 [myid:] - INFO  [main:SqlManager@98] - Using
default fetchSize of 1000
2016-09-21 21:00:15,095 [myid:] - INFO  [main:CodeGenTool@92] - Beginning
code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-hive.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-thin-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/home/hduser/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
2016-09-21 21:00:15,681 [myid:] - INFO  [main:OracleManager@417] - Time
zone has been set to GMT
2016-09-21 21:00:15,717 [myid:] - INFO  [main:SqlManager@757] - Executing
SQL statement: select * from sh.sales where            (1 = 0)
2016-09-21 21:00:15,727 [myid:] - INFO  [main:SqlManager@757] - Executing
SQL statement: select * from sh.sales where            (1 = 0)
2016-09-21 21:00:15,748 [myid:] - INFO  [main:CompilationManager@94] -
HADOOP_MAPRED_HOME is /home/hduser/hadoop-2.7.3/share/hadoop/mapreduce
Note:
/tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.java
uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

*2016-09-21 21:00:17,354 [myid:] - INFO  [main:CompilationManager@330] -
Writing jar file:
/tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.jar*2016-09-21
21:00:17,366 [myid:] - INFO  [main:ImportJobBase@237] - Beginning query
import.
2016-09-21 21:00:17,511 [myid:] - WARN  [main:NativeCodeLoader@62] - Unable
to load native-hadoop library for your platform... using builtin-java
classes where applicable
2016-09-21 21:00:17,516 [myid:] - INFO  [main:Configuration@840] -
mapred.jar is deprecated. Instead, use mapreduce.job.jar
2016-09-21 21:00:17,993 [myid:] - INFO  [main:Configuration@840] -
mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2016-09-21 21:00:18,094 [myid:] - INFO  [main:RMProxy@56] - Connecting to
ResourceManager at rhes564/50.140.197.217:8032
2016-09-21 21:00:23,441 [myid:] - INFO  [main:DBInputFormat@192] - Using
read commited transaction isolation
2016-09-21 21:00:23,442 [myid:] - INFO  [main:DataDrivenDBInputFormat@147]
- BoundingValsQuery: SELECT MIN(prod_id), MAX(prod_id) FROM (select * from
sh.sales where            (1 = 1) ) t1
2016-09-21 21:00:23,540 [myid:] - INFO  [main:JobSubmitter@394] - number of
splits:4
2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
mapred.job.name is deprecated. Instead, use mapreduce.job.name
2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
mapred.cache.files.timestamps is deprecated. Instead, use
mapreduce.job.cache.files.timestamps
2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
mapreduce.inputformat.class is deprecated. Instead, use
mapreduce.job.inputformat.class
2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
mapreduce.outputformat.class is deprecated. Instead, use
mapreduce.job.outputformat.class
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
mapred.output.value.class is deprecated. Instead, use
mapreduce.job.output.value.class
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
mapred.output.dir is deprecated. Instead, use
mapreduce.output.fileoutputformat.outputdir
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
mapred.job.classpath.files is deprecated. Instead, use
mapreduce.job.classpath.files
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - user.name
is deprecated. Instead, use mapreduce.job.user.name
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] -
mapred.cache.files.filesizes is deprecated. Instead, use
mapreduce.job.cache.files.filesizes
2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] -
mapred.output.key.class is deprecated. Instead, use
mapreduce.job.output.key.class
2016-09-21 21:00:23,656 [myid:] - INFO  [main:JobSubmitter@477] -
Submitting tokens for job: job_1474455325627_0045
2016-09-21 21:00:23,955 [myid:] - INFO  [main:YarnClientImpl@174] -
Submitted application application_1474455325627_0045 to ResourceManager at
rhes564/50.140.197.217:8032
2016-09-21 21:00:23,980 [myid:] - INFO  [main:Job@1272] - The url to track
the job: http://http://rhes564:8088/proxy/application_1474455325627_0045/
2016-09-21 21:00:23,981 [myid:] - INFO  [main:Job@1317] - Running job:
job_1474455325627_0045
2016-09-21 21:00:31,180 [myid:] - INFO  [main:Job@1338] - Job
job_1474455325627_0045 running in uber mode : false
2016-09-21 21:00:31,182 [myid:] - INFO  [main:Job@1345] -  map 0% reduce 0%
2016-09-21 21:00:40,260 [myid:] - INFO  [main:Job@1345] -  map 25% reduce 0%
2016-09-21 21:00:44,283 [myid:] - INFO  [main:Job@1345] -  map 50% reduce 0%
2016-09-21 21:00:48,308 [myid:] - INFO  [main:Job@1345] -  map 75% reduce 0%
2016-09-21 21:00:55,346 [myid:] - INFO  [main:Job@1345] -  map 100% reduce
0%

*2016-09-21 21:00:56,359 [myid:] - INFO  [main:Job@1356] - Job
job_1474455325627_0045 completed successfully2016-09-21 21:00:56,501
[myid:] - ERROR [main:ImportTool@607] - Imported Failed: No enum constant
org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS*







Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 September 2016 at 20:56, Michael Segel <[email protected]>
wrote:

> Uhmmm…
>
> A bit of a longer-ish answer…
>
> Spark may or may not be faster than sqoop. The standard caveats apply…
> YMMV.
>
> The reason I say this… you have a couple of limiting factors.  The main
> one being the number of connections allowed with the target RDBMS.
>
> Then there’s the data distribution within the partitions / ranges in the
> database.
> By this, I mean that using any parallel solution, you need to run copies
> of your query in parallel over different ranges within the database. Most
> of the time you may run the query over a database where there is even
> distribution… if not, then you will have one thread run longer than the
> others.  Note that this is a problem that both solutions would face.
>
> Then there’s the cluster itself.
> Again YMMV on your spark job vs a Map/Reduce job.
>
> In terms of launching the job, setup, etc … the spark job could take
> longer to setup.  But on long running queries, that becomes noise.
>
> The issue is what makes the most sense to you, where do you have the most
> experience, and what do you feel the most comfortable in using.
>
> The other issue is what do you do with the data (RDDs,DataSets, Frames,
> etc) once you have read the data?
>
>
> HTH
>
> -Mike
>
> PS. I know that I’m responding to an earlier message in the thread, but
> this is something that I’ve heard lots of questions about… and its not a
> simple thing to answer… Since this is a batch process.  The performance
> issues are moot.
>
> On Aug 24, 2016, at 5:07 PM, Mich Talebzadeh <[email protected]>
> wrote:
>
> Personally I prefer Spark JDBC.
>
> Both Sqoop and Spark rely on the same drivers.
>
> I think Spark is faster and if you have many nodes you can partition your
> incoming data and take advantage of Spark DAG + in memory offering.
>
> By default Sqoop will use Map-reduce which is pretty slow.
>
> Remember for Spark you will need to have sufficient memory
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 August 2016 at 22:39, Venkata Penikalapati <
> [email protected]> wrote:
>
>> Team,
>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
>> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
>> ?
>>
>> I'm performing few analytics using spark data for which data is residing
>> in rdbms.
>>
>> Please guide me with this.
>>
>>
>> Thanks
>> Venkata Karthik P
>>
>>
>
>

Re: Sqoop vs spark jdbc

Reply via email to