Re: spark architecture question -- Pleas Read

Jörn Franke Sun, 29 Jan 2017 05:45:49 -0800

One alternative could be the oracle Hadoop loader and other Oracle products, 
but you have to invest some money and probably buy their Hadoop Appliance, 
which you have to evaluate if it make sense (can get expensive with large 
clusters etc).


Another alternative would be to get rid of Oracle alltogether and use other 
databases.

However, can you elaborate a little bit on your use case and the business logic 
as well as SLA requires. Otherwise all recommendations are right because the 
requirements you presented are very generic.

About get rid of Hadoop - this depends! You will need some resource manager 
(yarn, mesos, kubernetes etc) and most likely also a distributed file system. 
Spark supports through the Hadoop apis a wide range of file systems, but does 
not need HDFS for persistence. You can have local filesystem (ie any file 
system mounted to a node, so also distributed ones, such as zfs), cloud file 
systems (s3, azure blob etc).



> On 29 Jan 2017, at 11:18, Alex <siri8...@gmail.com> wrote:
> 
> Hi All,
> 
> Thanks for your response .. Please find below flow diagram
> 
> Please help me out simplifying this architecture using Spark
> 
> 1) Can i skip step 1 to step 4 and directly store it in spark
> if I am storing it in spark where actually it is getting stored
> Do i need to retain HAdoop to store data
> or can i directly store it in spark and remove hadoop also?
> 
> I want to remove informatica for preprocessing and directly load the files 
> data coming from server to Hadoop/Spark
> 
> So My Question is Can i directly load files data to spark ? Then where 
> exactly the data will get stored.. Do I need to have Spark installed on Top 
> of HDFS?
> 
> 2) if I am retaining below architecture Can I store back output from spark 
> directly to oracle from step 5 to step 7 
> 
> and will spark way of storing it back to oracle will be better than using 
> sqoop performance wise
> 3)Can I use SPark scala UDF to process data from hive and retain entire 
> architecture 
> 
> which among the above would be optimal
> 
> 
> 
>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik <sachin.u.n...@gmail.com> 
>> wrote:
>> I strongly agree with Jorn and Russell. There are different solutions for 
>> data movement depending upon your needs frequency, bi-directional drivers. 
>> workflow, handling duplicate records. This is a space is known as " Change 
>> Data Capture - CDC" for short. If you need more information, I would be 
>> happy to chat with you.  I built some products in this space that 
>> extensively used connection pooling over ODBC/JDBC. 
>> 
>> Happy to chat if you need more information. 
>> 
>> -Sachin Naik
>> 
>> >>Hard to tell. Can you give more insights >>on what you try to achieve and 
>> >>what the data is about?
>> >>For example, depending on your use case sqoop can make sense or not.
>> Sent from my iPhone
>> 
>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer <russell.spit...@gmail.com> 
>>> wrote:
>>> 
>>> You can treat Oracle as a JDBC source 
>>> (http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
>>>  and skip Sqoop, HiveTables and go straight to Queries. Then you can skip 
>>> hive on the way back out (see the same link) and write directly to Oracle. 
>>> I'll leave the performance questions for someone else. 
>>> 
>>>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu <siri8...@gmail.com> 
>>>> wrote:
>>>> 
>>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu <siri8...@gmail.com> 
>>>> wrote:
>>>> Hi Team,
>>>> 
>>>> RIght now our existing flow is
>>>> 
>>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive 
>>>> Context)-->Destination Hive table -->sqoop export to Oracle
>>>> 
>>>> Half of the Hive UDFS required is developed in Java UDF..
>>>> 
>>>> SO Now I want to know if I run the native scala UDF's than runninng hive 
>>>> java udfs in spark-sql will there be any performance difference
>>>> 
>>>> 
>>>> Can we skip the Sqoop Import and export part and 
>>>> 
>>>> Instead directly load data from oracle to spark and code Scala UDF's for 
>>>> transformations and export output data back to oracle?
>>>> 
>>>> RIght now the architecture we are using is
>>>> 
>>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> 
>>>> Hive --> Oracle 
>>>> what would be optimal architecture to process data from oracle using spark 
>>>> ?? can i anyway better this process ?
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> Sirisha 
>>>> 
>

Re: spark architecture question -- Pleas Read

Reply via email to