Hi All, Thanks for your response .. Please find below flow diagram
Please help me out simplifying this architecture using Spark 1) Can i skip step 1 to step 4 and directly store it in spark if I am storing it in spark where actually it is getting stored Do i need to retain HAdoop to store data or can i directly store it in spark and remove hadoop also? I want to remove informatica for preprocessing and directly load the files data coming from server to Hadoop/Spark So My Question is Can i directly load files data to spark ? Then where exactly the data will get stored.. Do I need to have Spark installed on Top of HDFS? 2) if I am retaining below architecture Can I store back output from spark directly to oracle from step 5 to step 7 and will spark way of storing it back to oracle will be better than using sqoop performance wise 3)Can I use SPark scala UDF to process data from hive and retain entire architecture which among the above would be optimal [image: Inline image 1] On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik <sachin.u.n...@gmail.com> wrote: > I strongly agree with Jorn and Russell. There are different solutions for > data movement depending upon your needs frequency, bi-directional drivers. > workflow, handling duplicate records. This is a space is known as " Change > Data Capture - CDC" for short. If you need more information, I would be > happy to chat with you. I built some products in this space that > extensively used connection pooling over ODBC/JDBC. > > Happy to chat if you need more information. > > -Sachin Naik > > >>Hard to tell. Can you give more insights >>on what you try to achieve > and what the data is about? > >>For example, depending on your use case sqoop can make sense or not. > Sent from my iPhone > > On Jan 27, 2017, at 11:22 PM, Russell Spitzer <russell.spit...@gmail.com> > wrote: > > You can treat Oracle as a JDBC source (http://spark.apache.org/docs/ > latest/sql-programming-guide.html#jdbc-to-other-databases) and skip > Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the > way back out (see the same link) and write directly to Oracle. I'll leave > the performance questions for someone else. > > On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu <siri8...@gmail.com> > wrote: > >> >> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu <siri8...@gmail.com> >> wrote: >> >> Hi Team, >> >> RIght now our existing flow is >> >> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive >> Context)-->Destination Hive table -->sqoop export to Oracle >> >> Half of the Hive UDFS required is developed in Java UDF.. >> >> SO Now I want to know if I run the native scala UDF's than runninng hive >> java udfs in spark-sql will there be any performance difference >> >> >> Can we skip the Sqoop Import and export part and >> >> Instead directly load data from oracle to spark and code Scala UDF's for >> transformations and export output data back to oracle? >> >> RIght now the architecture we are using is >> >> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> >> Hive --> Oracle >> what would be optimal architecture to process data from oracle using >> spark ?? can i anyway better this process ? >> >> >> >> >> Regards, >> Sirisha >> >> >>