Re: Sqoop vs spark jdbc

Don Drake Wed, 21 Sep 2016 18:07:15 -0700

We just had this conversation at work today.  We have a long sqoop pipeline
and I argued to keep it in sqoop since we can take advantage of OraOop
(direct mode) for performance and spark can't match that AFAIK.  Sqoop also
allows us to write directly into parquet format, which then Spark can read
optimally.


In regards to the MB_MILLIS_MAP exception, I ran into this same exception a
while ago running Titan DB map/reduce jobs.  The cause was a mismatch in
.jars in the distribution.  Your CLASSPATH probably contains some old .jar
files.

-Don

On Wed, Sep 21, 2016 at 6:17 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> I do not know why this happening.
>
> Trying to load an Hbase table at command line
>
> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,c1,c2" t2
> hdfs://rhes564:9000/tmp/crap.txt
>
> Comes back with this error
>
>
> 2016-09-22 00:12:46,576 INFO  [main] mapreduce.JobSubmitter: Submitting
> tokens for job: job_1474455325627_0052
> 2016-09-22 00:12:46,755 INFO  [main] impl.YarnClientImpl: Submitted
> application application_1474455325627_0052 to ResourceManager at rhes564/
> 50.140.197.217:8032
> 2016-09-22 00:12:46,783 INFO  [main] mapreduce.Job: The url to track the
> job: http://http://rhes564:8088/proxy/application_1474455325627_0052/
> 2016-09-22 00:12:46,783 INFO  [main] mapreduce.Job: Running job:
> job_1474455325627_0052
> 2016-09-22 00:12:55,913 INFO  [main] mapreduce.Job: Job
> job_1474455325627_0052 running in uber mode : false
> 2016-09-22 00:12:55,915 INFO  [main] mapreduce.Job:  map 0% reduce 0%
> 2016-09-22 00:13:01,994 INFO  [main] mapreduce.Job:  map 100% reduce 0%
> 2016-09-22 00:13:03,008 INFO  [main] mapreduce.Job: Job
> job_1474455325627_0052 completed successfully
> Exception in thread "main" java.lang.IllegalArgumentException: No enum
> constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
>         at java.lang.Enum.valueOf(Enum.java:238)
>         at org.apache.hadoop.mapreduce.counters.
> FrameworkCounterGroup.valueOf(FrameworkCounterGroup.java:148)
>         at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.
> findCounter(FrameworkCounterGroup.java:182)
>         at org.apache.hadoop.mapreduce.counters.AbstractCounters.
> findCounter(AbstractCounters.java:154)
>         at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(
> TypeConverter.java:240)
>         at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(
> ClientServiceDelegate.java:370)
>         at org.apache.hadoop.mapred.YARNRunner.getJobCounters(
> YARNRunner.java:511)
>         at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756)
>         at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1491)
>         at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753)
>         at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.
> java:1361)
>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.
> java:1289)
>         at org.apache.hadoop.hbase.mapreduce.ImportTsv.run(
> ImportTsv.java:680)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>         at org.apache.hadoop.hbase.mapreduce.ImportTsv.main(
> ImportTsv.java:684)
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 September 2016 at 21:47, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> I think there might be still something messed up with the classpath. It
>> complains in the logs about deprecated jars and deprecated configuration
>> files.
>>
>> On 21 Sep 2016, at 22:21, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Well I am left to use Spark for importing data from RDBMS table to Hadoop.
>>
>> You may argue why and it is because Spark does it in one process and no
>> errors
>>
>> With sqoop I am getting this error message which leaves the RDBMS table
>> data on HDFS file but stops there.
>>
>> 2016-09-21 21:00:15,084 [myid:] - INFO  [main:OraOopLog@103] - Data
>> Connector for Oracle and Hadoop is disabled.
>> 2016-09-21 21:00:15,095 [myid:] - INFO  [main:SqlManager@98] - Using
>> default fetchSize of 1000
>> 2016-09-21 21:00:15,095 [myid:] - INFO  [main:CodeGenTool@92] -
>> Beginning code generation
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-
>> 0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-client.jar!/org
>> /slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-
>> 0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-hive.jar!/org/s
>> lf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-
>> 0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-thin-client.jar
>> !/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-
>> 0.98.21-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/
>> StaticLoggerBinder.class]
>> SLF4J: Found binding in [jar:file:/home/hduser/hadoop-
>> 2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/
>> slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> explanation.
>> 2016-09-21 21:00:15,681 [myid:] - INFO  [main:OracleManager@417] - Time
>> zone has been set to GMT
>> 2016-09-21 21:00:15,717 [myid:] - INFO  [main:SqlManager@757] -
>> Executing SQL statement: select * from sh.sales where            (1 = 0)
>> 2016-09-21 21:00:15,727 [myid:] - INFO  [main:SqlManager@757] -
>> Executing SQL statement: select * from sh.sales where            (1 = 0)
>> 2016-09-21 21:00:15,748 [myid:] - INFO  [main:CompilationManager@94] -
>> HADOOP_MAPRED_HOME is /home/hduser/hadoop-2.7.3/share/hadoop/mapreduce
>> Note: 
>> /tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.java
>> uses or overrides a deprecated API.
>> Note: Recompile with -Xlint:deprecation for details.
>>
>> *2016-09-21 21:00:17,354 [myid:] - INFO  [main:CompilationManager@330] -
>> Writing jar file:
>> /tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.jar*2016-09-21
>> 21:00:17,366 [myid:] - INFO  [main:ImportJobBase@237] - Beginning query
>> import.
>> 2016-09-21 21:00:17,511 [myid:] - WARN  [main:NativeCodeLoader@62] -
>> Unable to load native-hadoop library for your platform... using
>> builtin-java classes where applicable
>> 2016-09-21 21:00:17,516 [myid:] - INFO  [main:Configuration@840] -
>> mapred.jar is deprecated. Instead, use mapreduce.job.jar
>> 2016-09-21 21:00:17,993 [myid:] - INFO  [main:Configuration@840] -
>> mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
>> 2016-09-21 21:00:18,094 [myid:] - INFO  [main:RMProxy@56] - Connecting
>> to ResourceManager at rhes564/50.140.197.217:8032
>> 2016-09-21 21:00:23,441 [myid:] - INFO  [main:DBInputFormat@192] - Using
>> read commited transaction isolation
>> 2016-09-21 21:00:23,442 [myid:] - INFO  [main:DataDrivenDBInputFormat@147]
>> - BoundingValsQuery: SELECT MIN(prod_id), MAX(prod_id) FROM (select * from
>> sh.sales where            (1 = 1) ) t1
>> 2016-09-21 21:00:23,540 [myid:] - INFO  [main:JobSubmitter@394] - number
>> of splits:4
>> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
>> mapred.job.name is deprecated. Instead, use mapreduce.job.name
>> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
>> mapred.cache.files.timestamps is deprecated. Instead, use
>> mapreduce.job.cache.files.timestamps
>> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
>> mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
>> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
>> mapreduce.inputformat.class is deprecated. Instead, use
>> mapreduce.job.inputformat.class
>> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
>> mapreduce.outputformat.class is deprecated. Instead, use
>> mapreduce.job.outputformat.class
>> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
>> mapred.output.value.class is deprecated. Instead, use
>> mapreduce.job.output.value.class
>> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
>> mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputfor
>> mat.outputdir
>> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
>> mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
>> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
>> mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
>> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
>> mapred.job.classpath.files is deprecated. Instead, use
>> mapreduce.job.classpath.files
>> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
>> user.name is deprecated. Instead, use mapreduce.job.user.name
>> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
>> mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
>> 2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] -
>> mapred.cache.files.filesizes is deprecated. Instead, use
>> mapreduce.job.cache.files.filesizes
>> 2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] -
>> mapred.output.key.class is deprecated. Instead, use
>> mapreduce.job.output.key.class
>> 2016-09-21 21:00:23,656 [myid:] - INFO  [main:JobSubmitter@477] -
>> Submitting tokens for job: job_1474455325627_0045
>> 2016-09-21 21:00:23,955 [myid:] - INFO  [main:YarnClientImpl@174] -
>> Submitted application application_1474455325627_0045 to ResourceManager at
>> rhes564/50.140.197.217:8032
>> 2016-09-21 21:00:23,980 [myid:] - INFO  [main:Job@1272] - The url to
>> track the job: http://http://rhes564:8088/pro
>> xy/application_1474455325627_0045/
>> 2016-09-21 21:00:23,981 [myid:] - INFO  [main:Job@1317] - Running job:
>> job_1474455325627_0045
>> 2016-09-21 21:00:31,180 [myid:] - INFO  [main:Job@1338] - Job
>> job_1474455325627_0045 running in uber mode : false
>> 2016-09-21 21:00:31,182 [myid:] - INFO  [main:Job@1345] -  map 0% reduce
>> 0%
>> 2016-09-21 21:00:40,260 [myid:] - INFO  [main:Job@1345] -  map 25%
>> reduce 0%
>> 2016-09-21 21:00:44,283 [myid:] - INFO  [main:Job@1345] -  map 50%
>> reduce 0%
>> 2016-09-21 21:00:48,308 [myid:] - INFO  [main:Job@1345] -  map 75%
>> reduce 0%
>> 2016-09-21 21:00:55,346 [myid:] - INFO  [main:Job@1345] -  map 100%
>> reduce 0%
>>
>> *2016-09-21 21:00:56,359 [myid:] - INFO  [main:Job@1356] - Job
>> job_1474455325627_0045 completed successfully2016-09-21 21:00:56,501
>> [myid:] - ERROR [main:ImportTool@607] - Imported Failed: No enum constant
>> org.apache.hadoop.mapreduce.Jo
>> <http://org.apache.hadoop.mapreduce.Jo>bCounter.MB_MILLIS_MAPS*
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 21 September 2016 at 20:56, Michael Segel <michael_se...@hotmail.com>
>> wrote:
>>
>>> Uhmmm…
>>>
>>> A bit of a longer-ish answer…
>>>
>>> Spark may or may not be faster than sqoop. The standard caveats apply…
>>> YMMV.
>>>
>>> The reason I say this… you have a couple of limiting factors.  The main
>>> one being the number of connections allowed with the target RDBMS.
>>>
>>> Then there’s the data distribution within the partitions / ranges in the
>>> database.
>>> By this, I mean that using any parallel solution, you need to run copies
>>> of your query in parallel over different ranges within the database. Most
>>> of the time you may run the query over a database where there is even
>>> distribution… if not, then you will have one thread run longer than the
>>> others.  Note that this is a problem that both solutions would face.
>>>
>>> Then there’s the cluster itself.
>>> Again YMMV on your spark job vs a Map/Reduce job.
>>>
>>> In terms of launching the job, setup, etc … the spark job could take
>>> longer to setup.  But on long running queries, that becomes noise.
>>>
>>> The issue is what makes the most sense to you, where do you have the
>>> most experience, and what do you feel the most comfortable in using.
>>>
>>> The other issue is what do you do with the data (RDDs,DataSets, Frames,
>>> etc) once you have read the data?
>>>
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> PS. I know that I’m responding to an earlier message in the thread, but
>>> this is something that I’ve heard lots of questions about… and its not a
>>> simple thing to answer… Since this is a batch process.  The performance
>>> issues are moot.
>>>
>>> On Aug 24, 2016, at 5:07 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> Personally I prefer Spark JDBC.
>>>
>>> Both Sqoop and Spark rely on the same drivers.
>>>
>>> I think Spark is faster and if you have many nodes you can partition
>>> your incoming data and take advantage of Spark DAG + in memory offering.
>>>
>>> By default Sqoop will use Map-reduce which is pretty slow.
>>>
>>> Remember for Spark you will need to have sufficient memory
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 24 August 2016 at 22:39, Venkata Penikalapati <
>>> mail.venkatakart...@gmail.com> wrote:
>>>
>>>> Team,
>>>> Please help me in choosing sqoop or spark jdbc to fetch data from
>>>> rdbms. Sqoop has lot of optimizations to fetch data does spark jdbc also
>>>> has those ?
>>>>
>>>> I'm performing few analytics using spark data for which data is
>>>> residing in rdbms.
>>>>
>>>> Please guide me with this.
>>>>
>>>>
>>>> Thanks
>>>> Venkata Karthik P
>>>>
>>>>
>>>
>>>
>>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake <http://www.MailLaunder.com/>
800-733-2143

Re: Sqoop vs spark jdbc

Reply via email to