Thanks for the detailed description buddy. But this will actually be done
through NiFi (End to End) so we need to add the delta logic inside NiFi to
automate the whole process.
That's why, need a good (best) solution to solve this problem. Since, this
is a classic issue which we can face any compa
Your second point: That's going to be a bottleneck for all the programs
which will fetch the data from that folder and again add extra filters into
the DF. I want to finish that off, there itself.
And that merge logic is weak when one table is huge and the other is very
small (which is the case he
In reality a true real time analytics will require interrogating the
transaction (redo) log of the RDBMS database to see for changes.
An RDBMS will only keep on current record (the most recent) so if record is
deleted since last import into HDFS that record will not exist.
If the record has been
Well this one keeps cropping up in every project especially when hadoop
implemented alongside MPP.
For the fact, there is no reliable out of box update operation available in
hdfs or hive or SPARK.
Hence, one approach is what Mitch suggested, that do not update. Rather
just keep all source records,
Well this is a classic.
The initial load can be done through Sqoop (outside of Spark) or through
JDBC connection in Spark. 10 million rows in nothing.
Then you have to think of updates and deletes in addition to new rows.
With Sqoop you can load from the last ID in the source table, assuming tha
Well as far as I know there is some update statement planned for spark, but not
sure which release. You could alternatively use Hive+Orc.
Another alternative would be to add the deltas in a separate file and when
accessing the table filtering out the double entries. From time to time you
could
Hi all,
I'm trying to pull a full table from oracle, which is huge with some 10
million records which will be the initial load to HDFS.
Then I will do delta loads everyday in the same folder in HDFS.
Now, my query here is,
DAY 0 - I did the initial load (full dump).
DAY 1 - I'll load only that