Re: Sqoop on Spark

Jörn Franke Mon, 11 Apr 2016 10:27:06 -0700

Actually I was referring to have a an external table in Oracle, which is used 
to export to CSV (insert into). Then you have a csv on the database server 
which needs to be moved to HDFS.


> On 11 Apr 2016, at 17:50, Michael Segel <[email protected]> wrote:
> 
> Depending on the Oracle release… 
> 
> You could use webHDFS to gain access to the cluster and see the CSV file as 
> an external table. 
> 
> However, you would need to have an application that will read each block of 
> the file in parallel. This works for loading in to the RDBMS itself.  
> Actually you could use sqoop in reverse to push data to the RDBMS provided 
> that the block file is splittable.  This is a classic M/R problem. 
> 
> But I don’t think this is what the OP wants to do. They want to pull data 
> from the RDBMs. If you could drop the table’s underlying file and can read 
> directly from it… you can do a very simple bulk load/unload process. However 
> you need to know the file’s format. 
> 
> Not sure what IBM or Oracle has done to tie their RDBMs to Big Data. 
> 
> As I and other posters to this thread have alluded to… this would be a block 
> bulk load/unload tool. 
> 
> 
>> On Apr 10, 2016, at 11:31 AM, Jörn Franke <[email protected]> wrote:
>> 
>> 
>> I am not 100% sure, but you could export to CSV in Oracle using external 
>> tables.
>> 
>> Oracle has also the Hadoop Loader, which seems to support Avro. However, I 
>> think you need to buy the Big Data solution.
>> 
>>> On 10 Apr 2016, at 16:12, Mich Talebzadeh <[email protected]> wrote:
>>> 
>>> Yes I meant MR.
>>> 
>>> Again one cannot beat the RDBMS export utility. I was specifically 
>>> referring to Oracle in above case that does not provide any specific text 
>>> bases export except the binary one Exp, data pump etc).
>>> 
>>> In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy) that 
>>> can be parallelised either through range partitioning or simple round robin 
>>> partitioning that can be used to get data out to file in parallel. Then 
>>> once get data into Hive table through import etc.
>>> 
>>> In general if the source table is very large you can used either SAP 
>>> Replication Server (SRS) or Oracle Golden Gate to get data to Hive. Both 
>>> these replication tools provide connectors to Hive and they do a good job. 
>>> If one has something like Oracle in Prod then there is likely a Golden Gate 
>>> there. For bulk setting of Hive tables and data migration, replication 
>>> server is good option.
>>> 
>>> HTH
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
>>>> On 10 April 2016 at 14:24, Michael Segel <[email protected]> wrote:
>>>> Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce) 
>>>> 
>>>> The largest problem with sqoop is that in order to gain parallelism you 
>>>> need to know how your underlying table is partitioned and to do multiple 
>>>> range queries. This may not be known, or your data may or may not be 
>>>> equally distributed across the ranges.  
>>>> 
>>>> If you’re bringing over the entire table, you may find dropping it and 
>>>> then moving it to HDFS and then doing a bulk load to be more efficient.
>>>> (This is less flexible than sqoop, but also stresses the database servers 
>>>> less. ) 
>>>> 
>>>> Again, YMMV
>>>> 
>>>> 
>>>>> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh <[email protected]> 
>>>>> wrote:
>>>>> 
>>>>> Well unless you have plenty of memory, you are going to have certain 
>>>>> issues with Spark.
>>>>> 
>>>>> I tried to load a billion rows table from oracle through spark using JDBC 
>>>>> and ended up with "Caused by: java.lang.OutOfMemoryError: Java heap 
>>>>> space" error.
>>>>> 
>>>>> Sqoop uses MapR and does it in serial mode which takes time and you can 
>>>>> also tell it to create Hive table. However, it will import data into Hive 
>>>>> table.
>>>>> 
>>>>> In any case the mechanism of data import is through JDBC, Spark uses 
>>>>> memory and DAG, whereas Sqoop relies on MapR.
>>>>> 
>>>>> There is of course another alternative.
>>>>> 
>>>>> Assuming that your Oracle table has a primary Key say "ID" (it would be 
>>>>> easier if it was a monotonically increasing number) or already 
>>>>> partitioned.
>>>>> 
>>>>> You can create views based on the range of ID or for each partition. You 
>>>>> can then SELECT COLUMNS  co1, col2, coln from view and spool it to a text 
>>>>> file on OS (locally say backup directory would be fastest). 
>>>>> bzip2 those files and scp them to a local directory in Hadoop
>>>>> You can then use Spark/hive to load the target table from local files in 
>>>>> parallel
>>>>> When creating views take care of NUMBER and CHAR columns in Oracle and 
>>>>> convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS 
>>>>> VARCHAR2(n)) AS coln etc 
>>>>> 
>>>>> HTH
>>>>> 
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>>  
>>>>> 
>>>>>> On 8 April 2016 at 10:07, Gourav Sengupta <[email protected]> 
>>>>>> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Some metrics thrown around the discussion:
>>>>>> 
>>>>>> SQOOP: extract 500 million rows (in single thread) 20 mins (data size 21 
>>>>>> GB)
>>>>>> SPARK: load the data into memory (15 mins)
>>>>>> 
>>>>>> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to load 
>>>>>> 500 million records - manually killed after 8 hours.
>>>>>> 
>>>>>> (both the above studies were done in a system of same capacity, with 32 
>>>>>> GB RAM and dual hexacore Xeon processors and SSD. SPARK was running 
>>>>>> locally, and SQOOP ran on HADOOP2 and extracted data to local file 
>>>>>> system)
>>>>>> 
>>>>>> In case any one needs to know what needs to be done to access both the 
>>>>>> CSV and JDBC modules in SPARK Local Server mode, please let me know.
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Gourav Sengupta
>>>>>> 
>>>>>>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <[email protected]> 
>>>>>>> wrote:
>>>>>>> Good to know that.
>>>>>>> 
>>>>>>> That is why Sqoop has this "direct" mode, to utilize the vendor 
>>>>>>> specific feature.
>>>>>>> 
>>>>>>> But for MPP, I still think it makes sense that vendor provide some kind 
>>>>>>> of InputFormat, or data source in Spark, so Hadoop eco-system can 
>>>>>>> integrate with them more natively.
>>>>>>> 
>>>>>>> Yong
>>>>>>> 
>>>>>>> Date: Wed, 6 Apr 2016 16:12:30 -0700
>>>>>>> Subject: Re: Sqoop on Spark
>>>>>>> From: [email protected]
>>>>>>> To: [email protected]
>>>>>>> CC: [email protected]; [email protected]; 
>>>>>>> [email protected]; [email protected]; [email protected]; 
>>>>>>> [email protected]
>>>>>>> 
>>>>>>> 
>>>>>>> It is using JDBC driver, i know that's the case for Teradata:
>>>>>>> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available
>>>>>>> 
>>>>>>> Teradata Connector (which is used by Cloudera and Hortonworks) for 
>>>>>>> doing Sqoop is parallelized and works with ORC and probably other 
>>>>>>> formats as well. It is using JDBC for each connection between 
>>>>>>> data-nodes and their AMP (compute) nodes. There is an additional layer 
>>>>>>> that coordinates all of it.
>>>>>>> I know Oracle has a similar technology I've used it and had to supply 
>>>>>>> the JDBC driver.
>>>>>>> 
>>>>>>> Teradata Connector is for batch data copy, QueryGrid is for interactive 
>>>>>>> data movement.
>>>>>>> 
>>>>>>> On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <[email protected]> wrote:
>>>>>>> If they do that, they must provide a customized input format, instead 
>>>>>>> of through JDBC.
>>>>>>> 
>>>>>>> Yong
>>>>>>> 
>>>>>>> Date: Wed, 6 Apr 2016 23:56:54 +0100
>>>>>>> Subject: Re: Sqoop on Spark
>>>>>>> From: [email protected]
>>>>>>> To: [email protected]
>>>>>>> CC: [email protected]; [email protected]; 
>>>>>>> [email protected]; [email protected]; [email protected]
>>>>>>> 
>>>>>>> 
>>>>>>> SAP Sybase IQ does that and I believe SAP Hana as well.
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> On 6 April 2016 at 23:49, Peyman Mohajerian <[email protected]> wrote:
>>>>>>> For some MPP relational stores (not operational) it maybe feasible to 
>>>>>>> run Spark jobs and also have data locality. I know QueryGrid (Teradata) 
>>>>>>> and PolyBase (microsoft) use data locality to move data between their 
>>>>>>> MPP and Hadoop. 
>>>>>>> I would guess (have no idea) someone like IBM already is doing that for 
>>>>>>> Spark, maybe a bit off topic!
>>>>>>> 
>>>>>>> On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <[email protected]> 
>>>>>>> wrote:
>>>>>>> Well I am not sure, but using a database as a storage, such as 
>>>>>>> relational databases or certain nosql databases (eg MongoDB) for Spark 
>>>>>>> is generally a bad idea - no data locality, it cannot handle real big 
>>>>>>> data volumes for compute and you may potentially overload an 
>>>>>>> operational database. 
>>>>>>> And if your job fails for whatever reason (eg scheduling ) then you 
>>>>>>> have to pull everything out again. Sqoop and HDFS seems to me the more 
>>>>>>> elegant solution together with spark. These "assumption" on parallelism 
>>>>>>> have to be anyway made with any solution.
>>>>>>> Of course you can always redo things, but why - what benefit do you 
>>>>>>> expect? A real big data platform has to support anyway many different 
>>>>>>> tools otherwise people doing analytics will be limited. 
>>>>>>> 
>>>>>>> On 06 Apr 2016, at 20:05, Michael Segel <[email protected]> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> I don’t think its necessarily a bad idea.
>>>>>>> 
>>>>>>> Sqoop is an ugly tool and it requires you to make some assumptions as a 
>>>>>>> way to gain parallelism. (Not that most of the assumptions are not 
>>>>>>> valid for most of the use cases…) 
>>>>>>> 
>>>>>>> Depending on what you want to do… your data may not be persisted on 
>>>>>>> HDFS.  There are use cases where your cluster is used for compute and 
>>>>>>> not storage.
>>>>>>> 
>>>>>>> I’d say that spending time re-inventing the wheel can be a good thing. 
>>>>>>> It would be a good idea for many to rethink their ingestion process so 
>>>>>>> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing 
>>>>>>> that term from Dean Wampler. ;-) 
>>>>>>> 
>>>>>>> Just saying. ;-) 
>>>>>>> 
>>>>>>> -Mike
>>>>>>> 
>>>>>>> On Apr 5, 2016, at 10:44 PM, Jörn Franke <[email protected]> wrote:
>>>>>>> 
>>>>>>> I do not think you can be more resource efficient. In the end you have 
>>>>>>> to store the data anyway on HDFS . You have a lot of development effort 
>>>>>>> for doing something like sqoop. Especially with error handling. 
>>>>>>> You may create a ticket with the Sqoop guys to support Spark as an 
>>>>>>> execution engine and maybe it is less effort to plug it in there.
>>>>>>> Maybe if your cluster is loaded then you may want to add more machines 
>>>>>>> or improve the existing programs.
>>>>>>> 
>>>>>>> On 06 Apr 2016, at 07:33, ayan guha <[email protected]> wrote:
>>>>>>> 
>>>>>>> One of the reason in my mind is to avoid Map-Reduce application 
>>>>>>> completely during ingestion, if possible. Also, I can then use Spark 
>>>>>>> stand alone cluster to ingest, even if my hadoop cluster is heavily 
>>>>>>> loaded. What you guys think?
>>>>>>> 
>>>>>>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <[email protected]> 
>>>>>>> wrote:
>>>>>>> Why do you want to reimplement something which is already there?
>>>>>>> 
>>>>>>> On 06 Apr 2016, at 06:47, ayan guha <[email protected]> wrote:
>>>>>>> 
>>>>>>> Hi
>>>>>>> 
>>>>>>> Thanks for reply. My use case is query ~40 tables from Oracle (using 
>>>>>>> index and incremental only) and add data to existing Hive tables. Also, 
>>>>>>> it would be good to have an option to create Hive table, driven by job 
>>>>>>> specific configuration. 
>>>>>>> 
>>>>>>> What do you think?
>>>>>>> 
>>>>>>> Best
>>>>>>> Ayan
>>>>>>> 
>>>>>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro 
>>>>>>> <[email protected]> wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> It depends on your use case using sqoop.
>>>>>>> What's it like?
>>>>>>> 
>>>>>>> // maropu
>>>>>>> 
>>>>>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <[email protected]> wrote:
>>>>>>> Hi All
>>>>>>> 
>>>>>>> Asking opinion: is it possible/advisable to use spark to replace what 
>>>>>>> sqoop does? Any existing project done in similar lines?
>>>>>>> 
>>>>>>> -- 
>>>>>>> Best Regards,
>>>>>>> Ayan Guha
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> ---
>>>>>>> Takeshi Yamamuro
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Best Regards,
>>>>>>> Ayan Guha
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Best Regards,
>>>>>>> Ayan Guha
>

Re: Sqoop on Spark

Reply via email to