Re: Sqoop on Spark

Jörn Franke Sun, 10 Apr 2016 11:32:09 -0700

I am not 100% sure, but you could export to CSV in Oracle using external tables.


Oracle has also the Hadoop Loader, which seems to support Avro. However, I 
think you need to buy the Big Data solution.

> On 10 Apr 2016, at 16:12, Mich Talebzadeh <[email protected]> wrote:
> 
> Yes I meant MR.
> 
> Again one cannot beat the RDBMS export utility. I was specifically referring 
> to Oracle in above case that does not provide any specific text bases export 
> except the binary one Exp, data pump etc).
> 
> In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy) that 
> can be parallelised either through range partitioning or simple round robin 
> partitioning that can be used to get data out to file in parallel. Then once 
> get data into Hive table through import etc.
> 
> In general if the source table is very large you can used either SAP 
> Replication Server (SRS) or Oracle Golden Gate to get data to Hive. Both 
> these replication tools provide connectors to Hive and they do a good job. If 
> one has something like Oracle in Prod then there is likely a Golden Gate 
> there. For bulk setting of Hive tables and data migration, replication server 
> is good option.
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 10 April 2016 at 14:24, Michael Segel <[email protected]> wrote:
>> Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce) 
>> 
>> The largest problem with sqoop is that in order to gain parallelism you need 
>> to know how your underlying table is partitioned and to do multiple range 
>> queries. This may not be known, or your data may or may not be equally 
>> distributed across the ranges.  
>> 
>> If you’re bringing over the entire table, you may find dropping it and then 
>> moving it to HDFS and then doing a bulk load to be more efficient.
>> (This is less flexible than sqoop, but also stresses the database servers 
>> less. ) 
>> 
>> Again, YMMV
>> 
>> 
>>> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh <[email protected]> 
>>> wrote:
>>> 
>>> Well unless you have plenty of memory, you are going to have certain issues 
>>> with Spark.
>>> 
>>> I tried to load a billion rows table from oracle through spark using JDBC 
>>> and ended up with "Caused by: java.lang.OutOfMemoryError: Java heap space" 
>>> error.
>>> 
>>> Sqoop uses MapR and does it in serial mode which takes time and you can 
>>> also tell it to create Hive table. However, it will import data into Hive 
>>> table.
>>> 
>>> In any case the mechanism of data import is through JDBC, Spark uses memory 
>>> and DAG, whereas Sqoop relies on MapR.
>>> 
>>> There is of course another alternative.
>>> 
>>> Assuming that your Oracle table has a primary Key say "ID" (it would be 
>>> easier if it was a monotonically increasing number) or already partitioned.
>>> 
>>> You can create views based on the range of ID or for each partition. You 
>>> can then SELECT COLUMNS  co1, col2, coln from view and spool it to a text 
>>> file on OS (locally say backup directory would be fastest).
>>> bzip2 those files and scp them to a local directory in Hadoop
>>> You can then use Spark/hive to load the target table from local files in 
>>> parallel
>>> When creating views take care of NUMBER and CHAR columns in Oracle and 
>>> convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS 
>>> VARCHAR2(n)) AS coln etc 
>>> 
>>> HTH
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
>>>> On 8 April 2016 at 10:07, Gourav Sengupta <[email protected]> 
>>>> wrote:
>>>> Hi,
>>>> 
>>>> Some metrics thrown around the discussion:
>>>> 
>>>> SQOOP: extract 500 million rows (in single thread) 20 mins (data size 21 
>>>> GB)
>>>> SPARK: load the data into memory (15 mins)
>>>> 
>>>> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to load 
>>>> 500 million records - manually killed after 8 hours.
>>>> 
>>>> (both the above studies were done in a system of same capacity, with 32 GB 
>>>> RAM and dual hexacore Xeon processors and SSD. SPARK was running locally, 
>>>> and SQOOP ran on HADOOP2 and extracted data to local file system)
>>>> 
>>>> In case any one needs to know what needs to be done to access both the CSV 
>>>> and JDBC modules in SPARK Local Server mode, please let me know.
>>>> 
>>>> 
>>>> Regards,
>>>> Gourav Sengupta
>>>> 
>>>>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <[email protected]> wrote:
>>>>> Good to know that.
>>>>> 
>>>>> That is why Sqoop has this "direct" mode, to utilize the vendor specific 
>>>>> feature.
>>>>> 
>>>>> But for MPP, I still think it makes sense that vendor provide some kind 
>>>>> of InputFormat, or data source in Spark, so Hadoop eco-system can 
>>>>> integrate with them more natively.
>>>>> 
>>>>> Yong
>>>>> 
>>>>> Date: Wed, 6 Apr 2016 16:12:30 -0700
>>>>> Subject: Re: Sqoop on Spark
>>>>> From: [email protected]
>>>>> To: [email protected]
>>>>> CC: [email protected]; [email protected]; 
>>>>> [email protected]; [email protected]; [email protected]; 
>>>>> [email protected]
>>>>> 
>>>>> 
>>>>> It is using JDBC driver, i know that's the case for Teradata:
>>>>> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available
>>>>> 
>>>>> Teradata Connector (which is used by Cloudera and Hortonworks) for doing 
>>>>> Sqoop is parallelized and works with ORC and probably other formats as 
>>>>> well. It is using JDBC for each connection between data-nodes and their 
>>>>> AMP (compute) nodes. There is an additional layer that coordinates all of 
>>>>> it.
>>>>> I know Oracle has a similar technology I've used it and had to supply the 
>>>>> JDBC driver.
>>>>> 
>>>>> Teradata Connector is for batch data copy, QueryGrid is for interactive 
>>>>> data movement.
>>>>> 
>>>>> On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <[email protected]> wrote:
>>>>> If they do that, they must provide a customized input format, instead of 
>>>>> through JDBC.
>>>>> 
>>>>> Yong
>>>>> 
>>>>> Date: Wed, 6 Apr 2016 23:56:54 +0100
>>>>> Subject: Re: Sqoop on Spark
>>>>> From: [email protected]
>>>>> To: [email protected]
>>>>> CC: [email protected]; [email protected]; [email protected]; 
>>>>> [email protected]; [email protected]
>>>>> 
>>>>> 
>>>>> SAP Sybase IQ does that and I believe SAP Hana as well.
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>>  
>>>>> 
>>>>> On 6 April 2016 at 23:49, Peyman Mohajerian <[email protected]> wrote:
>>>>> For some MPP relational stores (not operational) it maybe feasible to run 
>>>>> Spark jobs and also have data locality. I know QueryGrid (Teradata) and 
>>>>> PolyBase (microsoft) use data locality to move data between their MPP and 
>>>>> Hadoop. 
>>>>> I would guess (have no idea) someone like IBM already is doing that for 
>>>>> Spark, maybe a bit off topic!
>>>>> 
>>>>> On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <[email protected]> wrote:
>>>>> Well I am not sure, but using a database as a storage, such as relational 
>>>>> databases or certain nosql databases (eg MongoDB) for Spark is generally 
>>>>> a bad idea - no data locality, it cannot handle real big data volumes for 
>>>>> compute and you may potentially overload an operational database. 
>>>>> And if your job fails for whatever reason (eg scheduling ) then you have 
>>>>> to pull everything out again. Sqoop and HDFS seems to me the more elegant 
>>>>> solution together with spark. These "assumption" on parallelism have to 
>>>>> be anyway made with any solution.
>>>>> Of course you can always redo things, but why - what benefit do you 
>>>>> expect? A real big data platform has to support anyway many different 
>>>>> tools otherwise people doing analytics will be limited. 
>>>>> 
>>>>> On 06 Apr 2016, at 20:05, Michael Segel <[email protected]> wrote:
>>>>> 
>>>>> I don’t think its necessarily a bad idea.
>>>>> 
>>>>> Sqoop is an ugly tool and it requires you to make some assumptions as a 
>>>>> way to gain parallelism. (Not that most of the assumptions are not valid 
>>>>> for most of the use cases…) 
>>>>> 
>>>>> Depending on what you want to do… your data may not be persisted on HDFS. 
>>>>>  There are use cases where your cluster is used for compute and not 
>>>>> storage.
>>>>> 
>>>>> I’d say that spending time re-inventing the wheel can be a good thing. 
>>>>> It would be a good idea for many to rethink their ingestion process so 
>>>>> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing 
>>>>> that term from Dean Wampler. ;-) 
>>>>> 
>>>>> Just saying. ;-) 
>>>>> 
>>>>> -Mike
>>>>> 
>>>>> On Apr 5, 2016, at 10:44 PM, Jörn Franke <[email protected]> wrote:
>>>>> 
>>>>> I do not think you can be more resource efficient. In the end you have to 
>>>>> store the data anyway on HDFS . You have a lot of development effort for 
>>>>> doing something like sqoop. Especially with error handling. 
>>>>> You may create a ticket with the Sqoop guys to support Spark as an 
>>>>> execution engine and maybe it is less effort to plug it in there.
>>>>> Maybe if your cluster is loaded then you may want to add more machines or 
>>>>> improve the existing programs.
>>>>> 
>>>>> On 06 Apr 2016, at 07:33, ayan guha <[email protected]> wrote:
>>>>> 
>>>>> One of the reason in my mind is to avoid Map-Reduce application 
>>>>> completely during ingestion, if possible. Also, I can then use Spark 
>>>>> stand alone cluster to ingest, even if my hadoop cluster is heavily 
>>>>> loaded. What you guys think?
>>>>> 
>>>>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <[email protected]> wrote:
>>>>> Why do you want to reimplement something which is already there?
>>>>> 
>>>>> On 06 Apr 2016, at 06:47, ayan guha <[email protected]> wrote:
>>>>> 
>>>>> Hi
>>>>> 
>>>>> Thanks for reply. My use case is query ~40 tables from Oracle (using 
>>>>> index and incremental only) and add data to existing Hive tables. Also, 
>>>>> it would be good to have an option to create Hive table, driven by job 
>>>>> specific configuration. 
>>>>> 
>>>>> What do you think?
>>>>> 
>>>>> Best
>>>>> Ayan
>>>>> 
>>>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <[email protected]> 
>>>>> wrote:
>>>>> Hi,
>>>>> 
>>>>> It depends on your use case using sqoop.
>>>>> What's it like?
>>>>> 
>>>>> // maropu
>>>>> 
>>>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <[email protected]> wrote:
>>>>> Hi All
>>>>> 
>>>>> Asking opinion: is it possible/advisable to use spark to replace what 
>>>>> sqoop does? Any existing project done in similar lines?
>>>>> 
>>>>> -- 
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> ---
>>>>> Takeshi Yamamuro
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Best Regards,
>>>>> Ayan Guha
>

Re: Sqoop on Spark

Reply via email to