Thanks for the detailed description buddy. But this will actually be done through NiFi (End to End) so we need to add the delta logic inside NiFi to automate the whole process.
That's why, need a good (best) solution to solve this problem. Since, this is a classic issue which we can face any company we work with. On 20-Jul-2016 1:38 AM, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote: > Well this is a classic. > > The initial load can be done through Sqoop (outside of Spark) or through > JDBC connection in Spark. 10 million rows in nothing. > > Then you have to think of updates and deletes in addition to new rows. > > With Sqoop you can load from the last ID in the source table, assuming > that you have a unique key in Your Oracle table. > > If you have 10 new roes and I assume you know how to load these rows from > Oracle. > > I suggest that you add two additional columns to your HDFS/target table, > > ,op_type int > ,op_time timestamp > > These two columns will specify the row type op_type = 1,2,3 > INSERT/UPDATE/DELETE and op_time = cast(from_unixtime(unix_timestamp()) > AS op_time) when the record was added. > > So you will end up with two additional columns in your HDFS table compared > to Oracle table and that will be your staging table. > > Of course you can do real time analytics through Oracle GoldenGate that > read the redolog of the source table in Oracle or better Sap Replication > Server (SRS). You will achieve real-time integration between RDBMS tables > and Big Data. > > Once you have you have the staging table (immutable) and the rest is > pretty easy. You have the full Entity Life History in this case for records > and you can do your queries on them. > > HTH > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 19 July 2016 at 20:27, Aakash Basu <raj2coo...@gmail.com> wrote: > >> Hi all, >> >> I'm trying to pull a full table from oracle, which is huge with some 10 >> million records which will be the initial load to HDFS. >> >> Then I will do delta loads everyday in the same folder in HDFS. >> >> Now, my query here is, >> >> DAY 0 - I did the initial load (full dump). >> >> DAY 1 - I'll load only that day's data which has suppose 10 records (5 >> old with some column's value altered and 5 new). >> >> Here, my question is, how will I push this file to HDFS through Spark >> code, if I do append, it will create duplicates (which i don't want), if i >> keep separate files and while using it in other program am giving the path >> of it as folder which contains all files /. But in this case also the >> registerTempTable will have duplicates for those 5 old rows. >> >> What is the BEST logic to be applied here? >> >> I tried to resolve this by doing a search in that file of the records if >> matching load the new ones by deleting the old, but this will be time >> consuming for such a huge record, right? >> >> Please help! >> >> Thanks, >> Aakash. >> > >