They wanted to have alternatives. I recommended the original approach of simply using sqoop.
> On 14 Apr 2016, at 16:09, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > > Hi, > > SQOOP just extracted for me 1,253,015,160 records in 30 minutes running in 4 > threads, that is 246 GB of data. > > Why is the discussion about using anything other than SQOOP still so > wonderfully on? > > > Regards, > Gourav > >> On Mon, Apr 11, 2016 at 6:26 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> Actually I was referring to have a an external table in Oracle, which is >> used to export to CSV (insert into). Then you have a csv on the database >> server which needs to be moved to HDFS. >> >>> On 11 Apr 2016, at 17:50, Michael Segel <msegel_had...@hotmail.com> wrote: >>> >>> Depending on the Oracle release… >>> >>> You could use webHDFS to gain access to the cluster and see the CSV file as >>> an external table. >>> >>> However, you would need to have an application that will read each block of >>> the file in parallel. This works for loading in to the RDBMS itself. >>> Actually you could use sqoop in reverse to push data to the RDBMS provided >>> that the block file is splittable. This is a classic M/R problem. >>> >>> But I don’t think this is what the OP wants to do. They want to pull data >>> from the RDBMs. If you could drop the table’s underlying file and can read >>> directly from it… you can do a very simple bulk load/unload process. >>> However you need to know the file’s format. >>> >>> Not sure what IBM or Oracle has done to tie their RDBMs to Big Data. >>> >>> As I and other posters to this thread have alluded to… this would be a >>> block bulk load/unload tool. >>> >>> >>>> On Apr 10, 2016, at 11:31 AM, Jörn Franke <jornfra...@gmail.com> wrote: >>>> >>>> >>>> I am not 100% sure, but you could export to CSV in Oracle using external >>>> tables. >>>> >>>> Oracle has also the Hadoop Loader, which seems to support Avro. However, I >>>> think you need to buy the Big Data solution. >>>> >>>>> On 10 Apr 2016, at 16:12, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>> wrote: >>>>> >>>>> Yes I meant MR. >>>>> >>>>> Again one cannot beat the RDBMS export utility. I was specifically >>>>> referring to Oracle in above case that does not provide any specific text >>>>> bases export except the binary one Exp, data pump etc). >>>>> >>>>> In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy) >>>>> that can be parallelised either through range partitioning or simple >>>>> round robin partitioning that can be used to get data out to file in >>>>> parallel. Then once get data into Hive table through import etc. >>>>> >>>>> In general if the source table is very large you can used either SAP >>>>> Replication Server (SRS) or Oracle Golden Gate to get data to Hive. Both >>>>> these replication tools provide connectors to Hive and they do a good >>>>> job. If one has something like Oracle in Prod then there is likely a >>>>> Golden Gate there. For bulk setting of Hive tables and data migration, >>>>> replication server is good option. >>>>> >>>>> HTH >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> LinkedIn >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>>> On 10 April 2016 at 14:24, Michael Segel <msegel_had...@hotmail.com> >>>>>> wrote: >>>>>> Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce) >>>>>> >>>>>> The largest problem with sqoop is that in order to gain parallelism you >>>>>> need to know how your underlying table is partitioned and to do multiple >>>>>> range queries. This may not be known, or your data may or may not be >>>>>> equally distributed across the ranges. >>>>>> >>>>>> If you’re bringing over the entire table, you may find dropping it and >>>>>> then moving it to HDFS and then doing a bulk load to be more efficient. >>>>>> (This is less flexible than sqoop, but also stresses the database >>>>>> servers less. ) >>>>>> >>>>>> Again, YMMV >>>>>> >>>>>> >>>>>>> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Well unless you have plenty of memory, you are going to have certain >>>>>>> issues with Spark. >>>>>>> >>>>>>> I tried to load a billion rows table from oracle through spark using >>>>>>> JDBC and ended up with "Caused by: java.lang.OutOfMemoryError: Java >>>>>>> heap space" error. >>>>>>> >>>>>>> Sqoop uses MapR and does it in serial mode which takes time and you can >>>>>>> also tell it to create Hive table. However, it will import data into >>>>>>> Hive table. >>>>>>> >>>>>>> In any case the mechanism of data import is through JDBC, Spark uses >>>>>>> memory and DAG, whereas Sqoop relies on MapR. >>>>>>> >>>>>>> There is of course another alternative. >>>>>>> >>>>>>> Assuming that your Oracle table has a primary Key say "ID" (it would be >>>>>>> easier if it was a monotonically increasing number) or already >>>>>>> partitioned. >>>>>>> >>>>>>> You can create views based on the range of ID or for each partition. >>>>>>> You can then SELECT COLUMNS co1, col2, coln from view and spool it to >>>>>>> a text file on OS (locally say backup directory would be fastest). >>>>>>> bzip2 those files and scp them to a local directory in Hadoop >>>>>>> You can then use Spark/hive to load the target table from local files >>>>>>> in parallel >>>>>>> When creating views take care of NUMBER and CHAR columns in Oracle and >>>>>>> convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS >>>>>>> VARCHAR2(n)) AS coln etc >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> >>>>>>> >>>>>>> Dr Mich Talebzadeh >>>>>>> >>>>>>> LinkedIn >>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>> >>>>>>> http://talebzadehmich.wordpress.com >>>>>>> >>>>>>> >>>>>>>> On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengu...@gmail.com> >>>>>>>> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> Some metrics thrown around the discussion: >>>>>>>> >>>>>>>> SQOOP: extract 500 million rows (in single thread) 20 mins (data size >>>>>>>> 21 GB) >>>>>>>> SPARK: load the data into memory (15 mins) >>>>>>>> >>>>>>>> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to >>>>>>>> load 500 million records - manually killed after 8 hours. >>>>>>>> >>>>>>>> (both the above studies were done in a system of same capacity, with >>>>>>>> 32 GB RAM and dual hexacore Xeon processors and SSD. SPARK was running >>>>>>>> locally, and SQOOP ran on HADOOP2 and extracted data to local file >>>>>>>> system) >>>>>>>> >>>>>>>> In case any one needs to know what needs to be done to access both the >>>>>>>> CSV and JDBC modules in SPARK Local Server mode, please let me know. >>>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> Gourav Sengupta >>>>>>>> >>>>>>>>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com> >>>>>>>>> wrote: >>>>>>>>> Good to know that. >>>>>>>>> >>>>>>>>> That is why Sqoop has this "direct" mode, to utilize the vendor >>>>>>>>> specific feature. >>>>>>>>> >>>>>>>>> But for MPP, I still think it makes sense that vendor provide some >>>>>>>>> kind of InputFormat, or data source in Spark, so Hadoop eco-system >>>>>>>>> can integrate with them more natively. >>>>>>>>> >>>>>>>>> Yong >>>>>>>>> >>>>>>>>> Date: Wed, 6 Apr 2016 16:12:30 -0700 >>>>>>>>> Subject: Re: Sqoop on Spark >>>>>>>>> From: mohaj...@gmail.com >>>>>>>>> To: java8...@hotmail.com >>>>>>>>> CC: mich.talebza...@gmail.com; jornfra...@gmail.com; >>>>>>>>> msegel_had...@hotmail.com; guha.a...@gmail.com; >>>>>>>>> linguin....@gmail.com; user@spark.apache.org >>>>>>>>> >>>>>>>>> >>>>>>>>> It is using JDBC driver, i know that's the case for Teradata: >>>>>>>>> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available >>>>>>>>> >>>>>>>>> Teradata Connector (which is used by Cloudera and Hortonworks) for >>>>>>>>> doing Sqoop is parallelized and works with ORC and probably other >>>>>>>>> formats as well. It is using JDBC for each connection between >>>>>>>>> data-nodes and their AMP (compute) nodes. There is an additional >>>>>>>>> layer that coordinates all of it. >>>>>>>>> I know Oracle has a similar technology I've used it and had to supply >>>>>>>>> the JDBC driver. >>>>>>>>> >>>>>>>>> Teradata Connector is for batch data copy, QueryGrid is for >>>>>>>>> interactive data movement. >>>>>>>>> >>>>>>>>> On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com> >>>>>>>>> wrote: >>>>>>>>> If they do that, they must provide a customized input format, instead >>>>>>>>> of through JDBC. >>>>>>>>> >>>>>>>>> Yong >>>>>>>>> >>>>>>>>> Date: Wed, 6 Apr 2016 23:56:54 +0100 >>>>>>>>> Subject: Re: Sqoop on Spark >>>>>>>>> From: mich.talebza...@gmail.com >>>>>>>>> To: mohaj...@gmail.com >>>>>>>>> CC: jornfra...@gmail.com; msegel_had...@hotmail.com; >>>>>>>>> guha.a...@gmail.com; linguin....@gmail.com; user@spark.apache.org >>>>>>>>> >>>>>>>>> >>>>>>>>> SAP Sybase IQ does that and I believe SAP Hana as well. >>>>>>>>> >>>>>>>>> Dr Mich Talebzadeh >>>>>>>>> >>>>>>>>> LinkedIn >>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>>> >>>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 6 April 2016 at 23:49, Peyman Mohajerian <mohaj...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> For some MPP relational stores (not operational) it maybe feasible to >>>>>>>>> run Spark jobs and also have data locality. I know QueryGrid >>>>>>>>> (Teradata) and PolyBase (microsoft) use data locality to move data >>>>>>>>> between their MPP and Hadoop. >>>>>>>>> I would guess (have no idea) someone like IBM already is doing that >>>>>>>>> for Spark, maybe a bit off topic! >>>>>>>>> >>>>>>>>> On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfra...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> Well I am not sure, but using a database as a storage, such as >>>>>>>>> relational databases or certain nosql databases (eg MongoDB) for >>>>>>>>> Spark is generally a bad idea - no data locality, it cannot handle >>>>>>>>> real big data volumes for compute and you may potentially overload an >>>>>>>>> operational database. >>>>>>>>> And if your job fails for whatever reason (eg scheduling ) then you >>>>>>>>> have to pull everything out again. Sqoop and HDFS seems to me the >>>>>>>>> more elegant solution together with spark. These "assumption" on >>>>>>>>> parallelism have to be anyway made with any solution. >>>>>>>>> Of course you can always redo things, but why - what benefit do you >>>>>>>>> expect? A real big data platform has to support anyway many different >>>>>>>>> tools otherwise people doing analytics will be limited. >>>>>>>>> >>>>>>>>> On 06 Apr 2016, at 20:05, Michael Segel <msegel_had...@hotmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> I don’t think its necessarily a bad idea. >>>>>>>>> >>>>>>>>> Sqoop is an ugly tool and it requires you to make some assumptions as >>>>>>>>> a way to gain parallelism. (Not that most of the assumptions are not >>>>>>>>> valid for most of the use cases…) >>>>>>>>> >>>>>>>>> Depending on what you want to do… your data may not be persisted on >>>>>>>>> HDFS. There are use cases where your cluster is used for compute and >>>>>>>>> not storage. >>>>>>>>> >>>>>>>>> I’d say that spending time re-inventing the wheel can be a good >>>>>>>>> thing. >>>>>>>>> It would be a good idea for many to rethink their ingestion process >>>>>>>>> so that they can have a nice ‘data lake’ and not a ‘data sewer’. >>>>>>>>> (Stealing that term from Dean Wampler. ;-) >>>>>>>>> >>>>>>>>> Just saying. ;-) >>>>>>>>> >>>>>>>>> -Mike >>>>>>>>> >>>>>>>>> On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> I do not think you can be more resource efficient. In the end you >>>>>>>>> have to store the data anyway on HDFS . You have a lot of development >>>>>>>>> effort for doing something like sqoop. Especially with error >>>>>>>>> handling. >>>>>>>>> You may create a ticket with the Sqoop guys to support Spark as an >>>>>>>>> execution engine and maybe it is less effort to plug it in there. >>>>>>>>> Maybe if your cluster is loaded then you may want to add more >>>>>>>>> machines or improve the existing programs. >>>>>>>>> >>>>>>>>> On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> One of the reason in my mind is to avoid Map-Reduce application >>>>>>>>> completely during ingestion, if possible. Also, I can then use Spark >>>>>>>>> stand alone cluster to ingest, even if my hadoop cluster is heavily >>>>>>>>> loaded. What you guys think? >>>>>>>>> >>>>>>>>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> Why do you want to reimplement something which is already there? >>>>>>>>> >>>>>>>>> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Hi >>>>>>>>> >>>>>>>>> Thanks for reply. My use case is query ~40 tables from Oracle (using >>>>>>>>> index and incremental only) and add data to existing Hive tables. >>>>>>>>> Also, it would be good to have an option to create Hive table, driven >>>>>>>>> by job specific configuration. >>>>>>>>> >>>>>>>>> What do you think? >>>>>>>>> >>>>>>>>> Best >>>>>>>>> Ayan >>>>>>>>> >>>>>>>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro >>>>>>>>> <linguin....@gmail.com> wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> It depends on your use case using sqoop. >>>>>>>>> What's it like? >>>>>>>>> >>>>>>>>> // maropu >>>>>>>>> >>>>>>>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote: >>>>>>>>> Hi All >>>>>>>>> >>>>>>>>> Asking opinion: is it possible/advisable to use spark to replace what >>>>>>>>> sqoop does? Any existing project done in similar lines? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Best Regards, >>>>>>>>> Ayan Guha >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> --- >>>>>>>>> Takeshi Yamamuro >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Best Regards, >>>>>>>>> Ayan Guha >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Best Regards, >>>>>>>>> Ayan Guha >