Re: Sqoop on Spark

Jörn Franke Thu, 14 Apr 2016 08:33:08 -0700

They wanted to have alternatives. I recommended the original approach of simply 
using sqoop.


> On 14 Apr 2016, at 16:09, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
> 
> Hi,
> 
> SQOOP just extracted for me 1,253,015,160 records in 30 minutes running in 4 
> threads, that is 246 GB of data.
> 
> Why is the discussion about using anything other than SQOOP still so 
> wonderfully on?
> 
> 
> Regards,
> Gourav
> 
>> On Mon, Apr 11, 2016 at 6:26 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>> Actually I was referring to have a an external table in Oracle, which is 
>> used to export to CSV (insert into). Then you have a csv on the database 
>> server which needs to be moved to HDFS.
>> 
>>> On 11 Apr 2016, at 17:50, Michael Segel <msegel_had...@hotmail.com> wrote:
>>> 
>>> Depending on the Oracle release… 
>>> 
>>> You could use webHDFS to gain access to the cluster and see the CSV file as 
>>> an external table. 
>>> 
>>> However, you would need to have an application that will read each block of 
>>> the file in parallel. This works for loading in to the RDBMS itself.  
>>> Actually you could use sqoop in reverse to push data to the RDBMS provided 
>>> that the block file is splittable.  This is a classic M/R problem. 
>>> 
>>> But I don’t think this is what the OP wants to do. They want to pull data 
>>> from the RDBMs. If you could drop the table’s underlying file and can read 
>>> directly from it… you can do a very simple bulk load/unload process. 
>>> However you need to know the file’s format. 
>>> 
>>> Not sure what IBM or Oracle has done to tie their RDBMs to Big Data. 
>>> 
>>> As I and other posters to this thread have alluded to… this would be a 
>>> block bulk load/unload tool. 
>>> 
>>> 
>>>> On Apr 10, 2016, at 11:31 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>>>> 
>>>> 
>>>> I am not 100% sure, but you could export to CSV in Oracle using external 
>>>> tables.
>>>> 
>>>> Oracle has also the Hadoop Loader, which seems to support Avro. However, I 
>>>> think you need to buy the Big Data solution.
>>>> 
>>>>> On 10 Apr 2016, at 16:12, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>> Yes I meant MR.
>>>>> 
>>>>> Again one cannot beat the RDBMS export utility. I was specifically 
>>>>> referring to Oracle in above case that does not provide any specific text 
>>>>> bases export except the binary one Exp, data pump etc).
>>>>> 
>>>>> In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy) 
>>>>> that can be parallelised either through range partitioning or simple 
>>>>> round robin partitioning that can be used to get data out to file in 
>>>>> parallel. Then once get data into Hive table through import etc.
>>>>> 
>>>>> In general if the source table is very large you can used either SAP 
>>>>> Replication Server (SRS) or Oracle Golden Gate to get data to Hive. Both 
>>>>> these replication tools provide connectors to Hive and they do a good 
>>>>> job. If one has something like Oracle in Prod then there is likely a 
>>>>> Golden Gate there. For bulk setting of Hive tables and data migration, 
>>>>> replication server is good option.
>>>>> 
>>>>> HTH
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>>  
>>>>> 
>>>>>> On 10 April 2016 at 14:24, Michael Segel <msegel_had...@hotmail.com> 
>>>>>> wrote:
>>>>>> Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce) 
>>>>>> 
>>>>>> The largest problem with sqoop is that in order to gain parallelism you 
>>>>>> need to know how your underlying table is partitioned and to do multiple 
>>>>>> range queries. This may not be known, or your data may or may not be 
>>>>>> equally distributed across the ranges.  
>>>>>> 
>>>>>> If you’re bringing over the entire table, you may find dropping it and 
>>>>>> then moving it to HDFS and then doing a bulk load to be more efficient.
>>>>>> (This is less flexible than sqoop, but also stresses the database 
>>>>>> servers less. ) 
>>>>>> 
>>>>>> Again, YMMV
>>>>>> 
>>>>>> 
>>>>>>> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Well unless you have plenty of memory, you are going to have certain 
>>>>>>> issues with Spark.
>>>>>>> 
>>>>>>> I tried to load a billion rows table from oracle through spark using 
>>>>>>> JDBC and ended up with "Caused by: java.lang.OutOfMemoryError: Java 
>>>>>>> heap space" error.
>>>>>>> 
>>>>>>> Sqoop uses MapR and does it in serial mode which takes time and you can 
>>>>>>> also tell it to create Hive table. However, it will import data into 
>>>>>>> Hive table.
>>>>>>> 
>>>>>>> In any case the mechanism of data import is through JDBC, Spark uses 
>>>>>>> memory and DAG, whereas Sqoop relies on MapR.
>>>>>>> 
>>>>>>> There is of course another alternative.
>>>>>>> 
>>>>>>> Assuming that your Oracle table has a primary Key say "ID" (it would be 
>>>>>>> easier if it was a monotonically increasing number) or already 
>>>>>>> partitioned.
>>>>>>> 
>>>>>>> You can create views based on the range of ID or for each partition. 
>>>>>>> You can then SELECT COLUMNS  co1, col2, coln from view and spool it to 
>>>>>>> a text file on OS (locally say backup directory would be fastest).
>>>>>>> bzip2 those files and scp them to a local directory in Hadoop
>>>>>>> You can then use Spark/hive to load the target table from local files 
>>>>>>> in parallel
>>>>>>> When creating views take care of NUMBER and CHAR columns in Oracle and 
>>>>>>> convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS 
>>>>>>> VARCHAR2(n)) AS coln etc 
>>>>>>> 
>>>>>>> HTH
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>  
>>>>>>> 
>>>>>>>> On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengu...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Some metrics thrown around the discussion:
>>>>>>>> 
>>>>>>>> SQOOP: extract 500 million rows (in single thread) 20 mins (data size 
>>>>>>>> 21 GB)
>>>>>>>> SPARK: load the data into memory (15 mins)
>>>>>>>> 
>>>>>>>> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to 
>>>>>>>> load 500 million records - manually killed after 8 hours.
>>>>>>>> 
>>>>>>>> (both the above studies were done in a system of same capacity, with 
>>>>>>>> 32 GB RAM and dual hexacore Xeon processors and SSD. SPARK was running 
>>>>>>>> locally, and SQOOP ran on HADOOP2 and extracted data to local file 
>>>>>>>> system)
>>>>>>>> 
>>>>>>>> In case any one needs to know what needs to be done to access both the 
>>>>>>>> CSV and JDBC modules in SPARK Local Server mode, please let me know.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Gourav Sengupta
>>>>>>>> 
>>>>>>>>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com> 
>>>>>>>>> wrote:
>>>>>>>>> Good to know that.
>>>>>>>>> 
>>>>>>>>> That is why Sqoop has this "direct" mode, to utilize the vendor 
>>>>>>>>> specific feature.
>>>>>>>>> 
>>>>>>>>> But for MPP, I still think it makes sense that vendor provide some 
>>>>>>>>> kind of InputFormat, or data source in Spark, so Hadoop eco-system 
>>>>>>>>> can integrate with them more natively.
>>>>>>>>> 
>>>>>>>>> Yong
>>>>>>>>> 
>>>>>>>>> Date: Wed, 6 Apr 2016 16:12:30 -0700
>>>>>>>>> Subject: Re: Sqoop on Spark
>>>>>>>>> From: mohaj...@gmail.com
>>>>>>>>> To: java8...@hotmail.com
>>>>>>>>> CC: mich.talebza...@gmail.com; jornfra...@gmail.com; 
>>>>>>>>> msegel_had...@hotmail.com; guha.a...@gmail.com; 
>>>>>>>>> linguin....@gmail.com; user@spark.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> It is using JDBC driver, i know that's the case for Teradata:
>>>>>>>>> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available
>>>>>>>>> 
>>>>>>>>> Teradata Connector (which is used by Cloudera and Hortonworks) for 
>>>>>>>>> doing Sqoop is parallelized and works with ORC and probably other 
>>>>>>>>> formats as well. It is using JDBC for each connection between 
>>>>>>>>> data-nodes and their AMP (compute) nodes. There is an additional 
>>>>>>>>> layer that coordinates all of it.
>>>>>>>>> I know Oracle has a similar technology I've used it and had to supply 
>>>>>>>>> the JDBC driver.
>>>>>>>>> 
>>>>>>>>> Teradata Connector is for batch data copy, QueryGrid is for 
>>>>>>>>> interactive data movement.
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com> 
>>>>>>>>> wrote:
>>>>>>>>> If they do that, they must provide a customized input format, instead 
>>>>>>>>> of through JDBC.
>>>>>>>>> 
>>>>>>>>> Yong
>>>>>>>>> 
>>>>>>>>> Date: Wed, 6 Apr 2016 23:56:54 +0100
>>>>>>>>> Subject: Re: Sqoop on Spark
>>>>>>>>> From: mich.talebza...@gmail.com
>>>>>>>>> To: mohaj...@gmail.com
>>>>>>>>> CC: jornfra...@gmail.com; msegel_had...@hotmail.com; 
>>>>>>>>> guha.a...@gmail.com; linguin....@gmail.com; user@spark.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> SAP Sybase IQ does that and I believe SAP Hana as well.
>>>>>>>>> 
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>  
>>>>>>>>> LinkedIn  
>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>  
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> On 6 April 2016 at 23:49, Peyman Mohajerian <mohaj...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>> For some MPP relational stores (not operational) it maybe feasible to 
>>>>>>>>> run Spark jobs and also have data locality. I know QueryGrid 
>>>>>>>>> (Teradata) and PolyBase (microsoft) use data locality to move data 
>>>>>>>>> between their MPP and Hadoop. 
>>>>>>>>> I would guess (have no idea) someone like IBM already is doing that 
>>>>>>>>> for Spark, maybe a bit off topic!
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfra...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>> Well I am not sure, but using a database as a storage, such as 
>>>>>>>>> relational databases or certain nosql databases (eg MongoDB) for 
>>>>>>>>> Spark is generally a bad idea - no data locality, it cannot handle 
>>>>>>>>> real big data volumes for compute and you may potentially overload an 
>>>>>>>>> operational database. 
>>>>>>>>> And if your job fails for whatever reason (eg scheduling ) then you 
>>>>>>>>> have to pull everything out again. Sqoop and HDFS seems to me the 
>>>>>>>>> more elegant solution together with spark. These "assumption" on 
>>>>>>>>> parallelism have to be anyway made with any solution.
>>>>>>>>> Of course you can always redo things, but why - what benefit do you 
>>>>>>>>> expect? A real big data platform has to support anyway many different 
>>>>>>>>> tools otherwise people doing analytics will be limited. 
>>>>>>>>> 
>>>>>>>>> On 06 Apr 2016, at 20:05, Michael Segel <msegel_had...@hotmail.com> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> I don’t think its necessarily a bad idea.
>>>>>>>>> 
>>>>>>>>> Sqoop is an ugly tool and it requires you to make some assumptions as 
>>>>>>>>> a way to gain parallelism. (Not that most of the assumptions are not 
>>>>>>>>> valid for most of the use cases…) 
>>>>>>>>> 
>>>>>>>>> Depending on what you want to do… your data may not be persisted on 
>>>>>>>>> HDFS.  There are use cases where your cluster is used for compute and 
>>>>>>>>> not storage.
>>>>>>>>> 
>>>>>>>>> I’d say that spending time re-inventing the wheel can be a good 
>>>>>>>>> thing. 
>>>>>>>>> It would be a good idea for many to rethink their ingestion process 
>>>>>>>>> so that they can have a nice ‘data lake’ and not a ‘data sewer’. 
>>>>>>>>> (Stealing that term from Dean Wampler. ;-) 
>>>>>>>>> 
>>>>>>>>> Just saying. ;-) 
>>>>>>>>> 
>>>>>>>>> -Mike
>>>>>>>>> 
>>>>>>>>> On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> I do not think you can be more resource efficient. In the end you 
>>>>>>>>> have to store the data anyway on HDFS . You have a lot of development 
>>>>>>>>> effort for doing something like sqoop. Especially with error 
>>>>>>>>> handling. 
>>>>>>>>> You may create a ticket with the Sqoop guys to support Spark as an 
>>>>>>>>> execution engine and maybe it is less effort to plug it in there.
>>>>>>>>> Maybe if your cluster is loaded then you may want to add more 
>>>>>>>>> machines or improve the existing programs.
>>>>>>>>> 
>>>>>>>>> On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> One of the reason in my mind is to avoid Map-Reduce application 
>>>>>>>>> completely during ingestion, if possible. Also, I can then use Spark 
>>>>>>>>> stand alone cluster to ingest, even if my hadoop cluster is heavily 
>>>>>>>>> loaded. What you guys think?
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>> Why do you want to reimplement something which is already there?
>>>>>>>>> 
>>>>>>>>> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hi
>>>>>>>>> 
>>>>>>>>> Thanks for reply. My use case is query ~40 tables from Oracle (using 
>>>>>>>>> index and incremental only) and add data to existing Hive tables. 
>>>>>>>>> Also, it would be good to have an option to create Hive table, driven 
>>>>>>>>> by job specific configuration. 
>>>>>>>>> 
>>>>>>>>> What do you think?
>>>>>>>>> 
>>>>>>>>> Best
>>>>>>>>> Ayan
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro 
>>>>>>>>> <linguin....@gmail.com> wrote:
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> It depends on your use case using sqoop.
>>>>>>>>> What's it like?
>>>>>>>>> 
>>>>>>>>> // maropu
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>>>>>> Hi All
>>>>>>>>> 
>>>>>>>>> Asking opinion: is it possible/advisable to use spark to replace what 
>>>>>>>>> sqoop does? Any existing project done in similar lines?
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> Best Regards,
>>>>>>>>> Ayan Guha
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> ---
>>>>>>>>> Takeshi Yamamuro
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> Best Regards,
>>>>>>>>> Ayan Guha
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> Best Regards,
>>>>>>>>> Ayan Guha
>

Re: Sqoop on Spark

Reply via email to