I just created an example of how to use JDBC to get Oracle data into Hive table using Sqoop. Please see thread below
How to use Spark JDBC to read from RDBMS table, create Hive ORC table and save RDBMS data in it HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 6 April 2016 at 22:41, Ranadip Chatterjee <ranadi...@gmail.com> wrote: > I know of projects that have done this but have never seen any advantage > of "using spark to do what sqoop does" - at least in a yarn cluster. Both > frameworks will have similar overheads of getting the containers allocated > by yarn and creating new jvms to do the work. Probably spark will have a > slightly higher overhead due to creation of RDD before writing the data to > hdfs - something that the sqoop mapper need not do. (So what am I > overlooking here?) > > In cases where a data pipeline is being built with the sqooped data being > the only trigger, there is a justification for using spark instead of sqoop > to short circuit the data directly into the transformation pipeline. > > Regards > Ranadip > On 6 Apr 2016 7:05 p.m., "Michael Segel" <msegel_had...@hotmail.com> > wrote: > >> I don’t think its necessarily a bad idea. >> >> Sqoop is an ugly tool and it requires you to make some assumptions as a >> way to gain parallelism. (Not that most of the assumptions are not valid >> for most of the use cases…) >> >> Depending on what you want to do… your data may not be persisted on >> HDFS. There are use cases where your cluster is used for compute and not >> storage. >> >> I’d say that spending time re-inventing the wheel can be a good thing. >> It would be a good idea for many to rethink their ingestion process so >> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing >> that term from Dean Wampler. ;-) >> >> Just saying. ;-) >> >> -Mike >> >> On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> >> I do not think you can be more resource efficient. In the end you have to >> store the data anyway on HDFS . You have a lot of development effort for >> doing something like sqoop. Especially with error handling. >> You may create a ticket with the Sqoop guys to support Spark as an >> execution engine and maybe it is less effort to plug it in there. >> Maybe if your cluster is loaded then you may want to add more machines or >> improve the existing programs. >> >> On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote: >> >> One of the reason in my mind is to avoid Map-Reduce application >> completely during ingestion, if possible. Also, I can then use Spark stand >> alone cluster to ingest, even if my hadoop cluster is heavily loaded. What >> you guys think? >> >> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> >>> Why do you want to reimplement something which is already there? >>> >>> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote: >>> >>> Hi >>> >>> Thanks for reply. My use case is query ~40 tables from Oracle (using >>> index and incremental only) and add data to existing Hive tables. Also, it >>> would be good to have an option to create Hive table, driven by job >>> specific configuration. >>> >>> What do you think? >>> >>> Best >>> Ayan >>> >>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin....@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> It depends on your use case using sqoop. >>>> What's it like? >>>> >>>> // maropu >>>> >>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote: >>>> >>>>> Hi All >>>>> >>>>> Asking opinion: is it possible/advisable to use spark to replace what >>>>> sqoop does? Any existing project done in similar lines? >>>>> >>>>> -- >>>>> Best Regards, >>>>> Ayan Guha >>>>> >>>> >>>> >>>> >>>> -- >>>> --- >>>> Takeshi Yamamuro >>>> >>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> >> >>