I am not 100% sure, but you could export to CSV in Oracle using external tables.
Oracle has also the Hadoop Loader, which seems to support Avro. However, I think you need to buy the Big Data solution. > On 10 Apr 2016, at 16:12, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Yes I meant MR. > > Again one cannot beat the RDBMS export utility. I was specifically referring > to Oracle in above case that does not provide any specific text bases export > except the binary one Exp, data pump etc). > > In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy) that > can be parallelised either through range partitioning or simple round robin > partitioning that can be used to get data out to file in parallel. Then once > get data into Hive table through import etc. > > In general if the source table is very large you can used either SAP > Replication Server (SRS) or Oracle Golden Gate to get data to Hive. Both > these replication tools provide connectors to Hive and they do a good job. If > one has something like Oracle in Prod then there is likely a Golden Gate > there. For bulk setting of Hive tables and data migration, replication server > is good option. > > HTH > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > >> On 10 April 2016 at 14:24, Michael Segel <msegel_had...@hotmail.com> wrote: >> Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce) >> >> The largest problem with sqoop is that in order to gain parallelism you need >> to know how your underlying table is partitioned and to do multiple range >> queries. This may not be known, or your data may or may not be equally >> distributed across the ranges. >> >> If you’re bringing over the entire table, you may find dropping it and then >> moving it to HDFS and then doing a bulk load to be more efficient. >> (This is less flexible than sqoop, but also stresses the database servers >> less. ) >> >> Again, YMMV >> >> >>> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh <mich.talebza...@gmail.com> >>> wrote: >>> >>> Well unless you have plenty of memory, you are going to have certain issues >>> with Spark. >>> >>> I tried to load a billion rows table from oracle through spark using JDBC >>> and ended up with "Caused by: java.lang.OutOfMemoryError: Java heap space" >>> error. >>> >>> Sqoop uses MapR and does it in serial mode which takes time and you can >>> also tell it to create Hive table. However, it will import data into Hive >>> table. >>> >>> In any case the mechanism of data import is through JDBC, Spark uses memory >>> and DAG, whereas Sqoop relies on MapR. >>> >>> There is of course another alternative. >>> >>> Assuming that your Oracle table has a primary Key say "ID" (it would be >>> easier if it was a monotonically increasing number) or already partitioned. >>> >>> You can create views based on the range of ID or for each partition. You >>> can then SELECT COLUMNS co1, col2, coln from view and spool it to a text >>> file on OS (locally say backup directory would be fastest). >>> bzip2 those files and scp them to a local directory in Hadoop >>> You can then use Spark/hive to load the target table from local files in >>> parallel >>> When creating views take care of NUMBER and CHAR columns in Oracle and >>> convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS >>> VARCHAR2(n)) AS coln etc >>> >>> HTH >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>>> On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengu...@gmail.com> >>>> wrote: >>>> Hi, >>>> >>>> Some metrics thrown around the discussion: >>>> >>>> SQOOP: extract 500 million rows (in single thread) 20 mins (data size 21 >>>> GB) >>>> SPARK: load the data into memory (15 mins) >>>> >>>> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to load >>>> 500 million records - manually killed after 8 hours. >>>> >>>> (both the above studies were done in a system of same capacity, with 32 GB >>>> RAM and dual hexacore Xeon processors and SSD. SPARK was running locally, >>>> and SQOOP ran on HADOOP2 and extracted data to local file system) >>>> >>>> In case any one needs to know what needs to be done to access both the CSV >>>> and JDBC modules in SPARK Local Server mode, please let me know. >>>> >>>> >>>> Regards, >>>> Gourav Sengupta >>>> >>>>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com> wrote: >>>>> Good to know that. >>>>> >>>>> That is why Sqoop has this "direct" mode, to utilize the vendor specific >>>>> feature. >>>>> >>>>> But for MPP, I still think it makes sense that vendor provide some kind >>>>> of InputFormat, or data source in Spark, so Hadoop eco-system can >>>>> integrate with them more natively. >>>>> >>>>> Yong >>>>> >>>>> Date: Wed, 6 Apr 2016 16:12:30 -0700 >>>>> Subject: Re: Sqoop on Spark >>>>> From: mohaj...@gmail.com >>>>> To: java8...@hotmail.com >>>>> CC: mich.talebza...@gmail.com; jornfra...@gmail.com; >>>>> msegel_had...@hotmail.com; guha.a...@gmail.com; linguin....@gmail.com; >>>>> user@spark.apache.org >>>>> >>>>> >>>>> It is using JDBC driver, i know that's the case for Teradata: >>>>> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available >>>>> >>>>> Teradata Connector (which is used by Cloudera and Hortonworks) for doing >>>>> Sqoop is parallelized and works with ORC and probably other formats as >>>>> well. It is using JDBC for each connection between data-nodes and their >>>>> AMP (compute) nodes. There is an additional layer that coordinates all of >>>>> it. >>>>> I know Oracle has a similar technology I've used it and had to supply the >>>>> JDBC driver. >>>>> >>>>> Teradata Connector is for batch data copy, QueryGrid is for interactive >>>>> data movement. >>>>> >>>>> On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com> wrote: >>>>> If they do that, they must provide a customized input format, instead of >>>>> through JDBC. >>>>> >>>>> Yong >>>>> >>>>> Date: Wed, 6 Apr 2016 23:56:54 +0100 >>>>> Subject: Re: Sqoop on Spark >>>>> From: mich.talebza...@gmail.com >>>>> To: mohaj...@gmail.com >>>>> CC: jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com; >>>>> linguin....@gmail.com; user@spark.apache.org >>>>> >>>>> >>>>> SAP Sybase IQ does that and I believe SAP Hana as well. >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> LinkedIn >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> >>>>> On 6 April 2016 at 23:49, Peyman Mohajerian <mohaj...@gmail.com> wrote: >>>>> For some MPP relational stores (not operational) it maybe feasible to run >>>>> Spark jobs and also have data locality. I know QueryGrid (Teradata) and >>>>> PolyBase (microsoft) use data locality to move data between their MPP and >>>>> Hadoop. >>>>> I would guess (have no idea) someone like IBM already is doing that for >>>>> Spark, maybe a bit off topic! >>>>> >>>>> On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfra...@gmail.com> wrote: >>>>> Well I am not sure, but using a database as a storage, such as relational >>>>> databases or certain nosql databases (eg MongoDB) for Spark is generally >>>>> a bad idea - no data locality, it cannot handle real big data volumes for >>>>> compute and you may potentially overload an operational database. >>>>> And if your job fails for whatever reason (eg scheduling ) then you have >>>>> to pull everything out again. Sqoop and HDFS seems to me the more elegant >>>>> solution together with spark. These "assumption" on parallelism have to >>>>> be anyway made with any solution. >>>>> Of course you can always redo things, but why - what benefit do you >>>>> expect? A real big data platform has to support anyway many different >>>>> tools otherwise people doing analytics will be limited. >>>>> >>>>> On 06 Apr 2016, at 20:05, Michael Segel <msegel_had...@hotmail.com> wrote: >>>>> >>>>> I don’t think its necessarily a bad idea. >>>>> >>>>> Sqoop is an ugly tool and it requires you to make some assumptions as a >>>>> way to gain parallelism. (Not that most of the assumptions are not valid >>>>> for most of the use cases…) >>>>> >>>>> Depending on what you want to do… your data may not be persisted on HDFS. >>>>> There are use cases where your cluster is used for compute and not >>>>> storage. >>>>> >>>>> I’d say that spending time re-inventing the wheel can be a good thing. >>>>> It would be a good idea for many to rethink their ingestion process so >>>>> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing >>>>> that term from Dean Wampler. ;-) >>>>> >>>>> Just saying. ;-) >>>>> >>>>> -Mike >>>>> >>>>> On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote: >>>>> >>>>> I do not think you can be more resource efficient. In the end you have to >>>>> store the data anyway on HDFS . You have a lot of development effort for >>>>> doing something like sqoop. Especially with error handling. >>>>> You may create a ticket with the Sqoop guys to support Spark as an >>>>> execution engine and maybe it is less effort to plug it in there. >>>>> Maybe if your cluster is loaded then you may want to add more machines or >>>>> improve the existing programs. >>>>> >>>>> On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote: >>>>> >>>>> One of the reason in my mind is to avoid Map-Reduce application >>>>> completely during ingestion, if possible. Also, I can then use Spark >>>>> stand alone cluster to ingest, even if my hadoop cluster is heavily >>>>> loaded. What you guys think? >>>>> >>>>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote: >>>>> Why do you want to reimplement something which is already there? >>>>> >>>>> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote: >>>>> >>>>> Hi >>>>> >>>>> Thanks for reply. My use case is query ~40 tables from Oracle (using >>>>> index and incremental only) and add data to existing Hive tables. Also, >>>>> it would be good to have an option to create Hive table, driven by job >>>>> specific configuration. >>>>> >>>>> What do you think? >>>>> >>>>> Best >>>>> Ayan >>>>> >>>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin....@gmail.com> >>>>> wrote: >>>>> Hi, >>>>> >>>>> It depends on your use case using sqoop. >>>>> What's it like? >>>>> >>>>> // maropu >>>>> >>>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote: >>>>> Hi All >>>>> >>>>> Asking opinion: is it possible/advisable to use spark to replace what >>>>> sqoop does? Any existing project done in similar lines? >>>>> >>>>> -- >>>>> Best Regards, >>>>> Ayan Guha >>>>> >>>>> >>>>> >>>>> -- >>>>> --- >>>>> Takeshi Yamamuro >>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards, >>>>> Ayan Guha >>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards, >>>>> Ayan Guha >