Well Had to write a Scala code and compile it with Maven to make it work. Still doing it. The good thing as I expected it is doing a Direct Path Read (as opposed to the Conventional Path Read) from the source Oracle database.
+-------------------------------------------------------------------------------------------+ | What Object causing the highest resource wait from V$ACTIVE_SESSION_HISIORY, dba_objects | +-------------------------------------------------------------------------------------------+ Object Name Type Event Total Wait Time/ms ------------------------------ ---------- -------------------------------------------------- ------------------ DUMMY TABLE 3 DUMMY TABLE direct path read 56 Well it is a billion table loaded from DF into temp table. The code actually creates the Hive ORC table in Hive database and populates it from temp table. See How it goes Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 30 April 2016 at 15:24, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > yes I was thinking of that. use Spark to load JDBC data from Oracle and > flush it into ORC table in Hive. > > Now I am using Spark 1.6.1 and JDBC driver as I recall (I raised a thread > for it) throwing error. > > This was working under Spark 1.5.2. > > Cheers > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 30 April 2016 at 15:20, Marcin Tustin <mtus...@handybook.com> wrote: > >> No, the execution engines are not in general interchangeable. The Hive >> project uses an abstraction layer to be able to plug different execution >> engines. I don't know if sqoop uses hive code, or if it uses an old >> version, or what. >> >> As with many things in the hadoop world, if you want to know if there's >> something undocumented, your best bet is to look at the source code. >> >> My suggestion would be to (1) make sure you're executing somewhere close >> to the data - i.e. on nodemanagers colocated with datanodes; (2) profile to >> make sure the slowness really is where you think; and (3) if you really >> can't get the speed you need, try writing a small spark job to do the >> export. Newer versions of spark seem faster. >> >> >> On Sat, Apr 30, 2016 at 10:05 AM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi Marcin, >>> >>> It is the speed really. The speed in which data is digested into Hive. >>> >>> Sqoop is two stage as I understand. >>> >>> >>> 1. Take the data out of RDMSD via JADB and put in on an external >>> HDFS file >>> 2. Read that file and insert into a Hive table >>> >>> The issue is the second part. In general I use Hive 2 with Spark 1.3.1 >>> engine to put data into Hive table. I wondered if there was such a >>> parameter in Sqoop to use Spark engine. >>> >>> Well I gather this is easier said that done. I am importing 1 billion >>> rows table from Oracle >>> >>> sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb12" >>> --username scratchpad -P \ >>> --query "select * from scratchpad.dummy where \ >>> \$CONDITIONS" \ >>> --split-by ID \ >>> --hive-import --hive-table "oraclehadoop.dummy" --target-dir >>> "dummy" >>> >>> >>> Now the fact that in hive-site.xml I have set >>> hive.execution.engine=spark does not matter. Sqoop seems to internally set >>> hive.execution.engine=mr anyway. >>> >>> May be there should be an option --hive-execution-engine='mr/tez/spak' >>> etc in above command? >>> >>> Cheers, >>> >>> Mich >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 30 April 2016 at 14:51, Marcin Tustin <mtus...@handybook.com> wrote: >>> >>>> They're not simply interchangeable. sqoop is written to use mapreduce. >>>> >>>> I actually implemented my own replacement for sqoop-export in spark, >>>> which was extremely simple. It wasn't any faster, because the bottleneck >>>> was the receiving database. >>>> >>>> Is your motivation here speed? Or correctness? >>>> >>>> On Sat, Apr 30, 2016 at 8:45 AM, Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> What is the simplest way of making sqoop import use spark engine as >>>>> opposed to the default mapreduce when putting data into hive table. I did >>>>> not see any parameter for this in sqoop command line doc. >>>>> >>>>> Thanks >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> >>>> >>>> >>>> Want to work at Handy? Check out our culture deck and open roles >>>> <http://www.handy.com/careers> >>>> Latest news <http://www.handy.com/press> at Handy >>>> Handy just raised $50m >>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> >>>> led >>>> by Fidelity >>>> >>>> >>> >> >> Want to work at Handy? Check out our culture deck and open roles >> <http://www.handy.com/careers> >> Latest news <http://www.handy.com/press> at Handy >> Handy just raised $50m >> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> >> led >> by Fidelity >> >> >