RE: Sqoop on Spark

Yong Zhang Wed, 06 Apr 2016 16:26:41 -0700

Good to know that.
That is why Sqoop has this "direct" mode, to utilize the vendor specific 
feature.
But for MPP, I still think it makes sense that vendor provide some kind of 
InputFormat, or data source in Spark, so Hadoop eco-system can integrate with 
them more natively.
Yong

Date: Wed, 6 Apr 2016 16:12:30 -0700
Subject: Re: Sqoop on Spark
From: mohaj...@gmail.com
To: java8...@hotmail.com
CC: mich.talebza...@gmail.com; jornfra...@gmail.com; msegel_had...@hotmail.com; 
guha.a...@gmail.com; linguin....@gmail.com; user@spark.apache.org

It is using JDBC driver, i know that's the case for 
Teradata:http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available

Teradata Connector (which is used by Cloudera and Hortonworks) for doing Sqoop 
is parallelized and works with ORC and probably other formats as well. It is 
using JDBC for each connection between data-nodes and their AMP (compute) 
nodes. There is an additional layer that coordinates all of it.I know Oracle 
has a similar technology I've used it and had to supply the JDBC driver.
Teradata Connector is for batch data copy, QueryGrid is for interactive data 
movement.
On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com> wrote:

If they do that, they must provide a customized input format, instead of 
through JDBC.
Yong

Date: Wed, 6 Apr 2016 23:56:54 +0100
Subject: Re: Sqoop on Spark
From: mich.talebza...@gmail.com
To: mohaj...@gmail.com
CC: jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com; 
linguin....@gmail.com; user@spark.apache.org

SAP Sybase IQ does that and I believe SAP Hana as well.

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

On 6 April 2016 at 23:49, Peyman Mohajerian <mohaj...@gmail.com> wrote:
For some MPP relational stores (not operational) it maybe feasible to run Spark 
jobs and also have data locality. I know QueryGrid (Teradata) and PolyBase 
(microsoft) use data locality to move data between their MPP and Hadoop. I 
would guess (have no idea) someone like IBM already is doing that for Spark, 
maybe a bit off topic!
On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfra...@gmail.com> wrote:
Well I am not sure, but using a database as a storage, such as relational 
databases or certain nosql databases (eg MongoDB) for Spark is generally a bad 
idea - no data locality, it cannot handle real big data volumes for compute and 
you may potentially overload an operational database. And if your job fails for 
whatever reason (eg scheduling ) then you have to pull everything out again. 
Sqoop and HDFS seems to me the more elegant solution together with spark. These 
"assumption" on parallelism have to be anyway made with any solution.Of course 
you can always redo things, but why - what benefit do you expect? A real big 
data platform has to support anyway many different tools otherwise people doing 
analytics will be limited. 
On 06 Apr 2016, at 20:05, Michael Segel <msegel_had...@hotmail.com> wrote:

I don’t think its necessarily a bad idea.
Sqoop is an ugly tool and it requires you to make some assumptions as a way to 
gain parallelism. (Not that most of the assumptions are not valid for most of 
the use cases…) 
Depending on what you want to do… your data may not be persisted on HDFS.  
There are use cases where your cluster is used for compute and not storage.
I’d say that spending time re-inventing the wheel can be a good thing. It would 
be a good idea for many to rethink their ingestion process so that they can 
have a nice ‘data lake’ and not a ‘data sewer’. (Stealing that term from Dean 
Wampler. ;-) 
Just saying. ;-) 
-Mike
On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote:
I do not think you can be more resource efficient. In the end you have to store 
the data anyway on HDFS . You have a lot of development effort for doing 
something like sqoop. Especially with error handling. You may create a ticket 
with the Sqoop guys to support Spark as an execution engine and maybe it is 
less effort to plug it in there.Maybe if your cluster is loaded then you may 
want to add more machines or improve the existing programs.
On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote:

One of the reason in my mind is to avoid Map-Reduce application completely 
during ingestion, if possible. Also, I can then use Spark stand alone cluster 
to ingest, even if my hadoop cluster is heavily loaded. What you guys think?
On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote:
Why do you want to reimplement something which is already there?
On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote:

Hi
Thanks for reply. My use case is query ~40 tables from Oracle (using index and 
incremental only) and add data to existing Hive tables. Also, it would be good 
to have an option to create Hive table, driven by job specific configuration. 
What do you think?
BestAyan
On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin....@gmail.com> wrote:
Hi,
It depends on your use case using sqoop.What's it like?
// maropu
On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote:
Hi All
Asking opinion: is it possible/advisable to use spark to replace what sqoop 
does? Any existing project done in similar lines?
-- 
Best Regards,
Ayan Guha

-- 
---
Takeshi Yamamuro

-- 
Best Regards,
Ayan Guha

-- 
Best Regards,
Ayan Guha

RE: Sqoop on Spark

Reply via email to