For some MPP relational stores (not operational) it maybe feasible to run Spark jobs and also have data locality. I know QueryGrid (Teradata) and PolyBase (microsoft) use data locality to move data between their MPP and Hadoop. I would guess (have no idea) someone like IBM already is doing that for Spark, maybe a bit off topic!
On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfra...@gmail.com> wrote: > Well I am not sure, but using a database as a storage, such as relational > databases or certain nosql databases (eg MongoDB) for Spark is generally a > bad idea - no data locality, it cannot handle real big data volumes for > compute and you may potentially overload an operational database. > And if your job fails for whatever reason (eg scheduling ) then you have > to pull everything out again. Sqoop and HDFS seems to me the more elegant > solution together with spark. These "assumption" on parallelism have to be > anyway made with any solution. > Of course you can always redo things, but why - what benefit do you > expect? A real big data platform has to support anyway many different tools > otherwise people doing analytics will be limited. > > On 06 Apr 2016, at 20:05, Michael Segel <msegel_had...@hotmail.com> wrote: > > I don’t think its necessarily a bad idea. > > Sqoop is an ugly tool and it requires you to make some assumptions as a > way to gain parallelism. (Not that most of the assumptions are not valid > for most of the use cases…) > > Depending on what you want to do… your data may not be persisted on HDFS. > There are use cases where your cluster is used for compute and not storage. > > I’d say that spending time re-inventing the wheel can be a good thing. > It would be a good idea for many to rethink their ingestion process so > that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing > that term from Dean Wampler. ;-) > > Just saying. ;-) > > -Mike > > On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote: > > I do not think you can be more resource efficient. In the end you have to > store the data anyway on HDFS . You have a lot of development effort for > doing something like sqoop. Especially with error handling. > You may create a ticket with the Sqoop guys to support Spark as an > execution engine and maybe it is less effort to plug it in there. > Maybe if your cluster is loaded then you may want to add more machines or > improve the existing programs. > > On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote: > > One of the reason in my mind is to avoid Map-Reduce application completely > during ingestion, if possible. Also, I can then use Spark stand alone > cluster to ingest, even if my hadoop cluster is heavily loaded. What you > guys think? > > On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote: > >> Why do you want to reimplement something which is already there? >> >> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote: >> >> Hi >> >> Thanks for reply. My use case is query ~40 tables from Oracle (using >> index and incremental only) and add data to existing Hive tables. Also, it >> would be good to have an option to create Hive table, driven by job >> specific configuration. >> >> What do you think? >> >> Best >> Ayan >> >> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin....@gmail.com> >> wrote: >> >>> Hi, >>> >>> It depends on your use case using sqoop. >>> What's it like? >>> >>> // maropu >>> >>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote: >>> >>>> Hi All >>>> >>>> Asking opinion: is it possible/advisable to use spark to replace what >>>> sqoop does? Any existing project done in similar lines? >>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>> >>> >>> >>> -- >>> --- >>> Takeshi Yamamuro >>> >> >> >> >> -- >> Best Regards, >> Ayan Guha >> >> > > > -- > Best Regards, > Ayan Guha > > >