Re: Sqoop on Spark

Peyman Mohajerian Wed, 06 Apr 2016 15:50:07 -0700

For some MPP relational stores (not operational) it maybe feasible to run
Spark jobs and also have data locality. I know QueryGrid (Teradata) and
PolyBase (microsoft) use data locality to move data between their MPP and
Hadoop.
I would guess (have no idea) someone like IBM already is doing that for
Spark, maybe a bit off topic!


On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Well I am not sure, but using a database as a storage, such as relational
> databases or certain nosql databases (eg MongoDB) for Spark is generally a
> bad idea - no data locality, it cannot handle real big data volumes for
> compute and you may potentially overload an operational database.
> And if your job fails for whatever reason (eg scheduling ) then you have
> to pull everything out again. Sqoop and HDFS seems to me the more elegant
> solution together with spark. These "assumption" on parallelism have to be
> anyway made with any solution.
> Of course you can always redo things, but why - what benefit do you
> expect? A real big data platform has to support anyway many different tools
> otherwise people doing analytics will be limited.
>
> On 06 Apr 2016, at 20:05, Michael Segel <msegel_had...@hotmail.com> wrote:
>
> I don’t think its necessarily a bad idea.
>
> Sqoop is an ugly tool and it requires you to make some assumptions as a
> way to gain parallelism. (Not that most of the assumptions are not valid
> for most of the use cases…)
>
> Depending on what you want to do… your data may not be persisted on HDFS.
> There are use cases where your cluster is used for compute and not storage.
>
> I’d say that spending time re-inventing the wheel can be a good thing.
> It would be a good idea for many to rethink their ingestion process so
> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing
> that term from Dean Wampler. ;-)
>
> Just saying. ;-)
>
> -Mike
>
> On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> I do not think you can be more resource efficient. In the end you have to
> store the data anyway on HDFS . You have a lot of development effort for
> doing something like sqoop. Especially with error handling.
> You may create a ticket with the Sqoop guys to support Spark as an
> execution engine and maybe it is less effort to plug it in there.
> Maybe if your cluster is loaded then you may want to add more machines or
> improve the existing programs.
>
> On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote:
>
> One of the reason in my mind is to avoid Map-Reduce application completely
> during ingestion, if possible. Also, I can then use Spark stand alone
> cluster to ingest, even if my hadoop cluster is heavily loaded. What you
> guys think?
>
> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Why do you want to reimplement something which is already there?
>>
>> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote:
>>
>> Hi
>>
>> Thanks for reply. My use case is query ~40 tables from Oracle (using
>> index and incremental only) and add data to existing Hive tables. Also, it
>> would be good to have an option to create Hive table, driven by job
>> specific configuration.
>>
>> What do you think?
>>
>> Best
>> Ayan
>>
>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin....@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> It depends on your use case using sqoop.
>>> What's it like?
>>>
>>> // maropu
>>>
>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> Hi All
>>>>
>>>> Asking opinion: is it possible/advisable to use spark to replace what
>>>> sqoop does? Any existing project done in similar lines?
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>

Re: Sqoop on Spark

Reply via email to