Hi All, I have around 100B records where I get new , update & delete records. Update/delete records are not that frequent. I would like to get some advice on below:
1) should I use rdd + reducibly or DataFrame window operation for data of this size? Which one would outperform the other? Which is more reliable and low maintenance? 2) Also how would you suggest we do incremental deduplication? Currently we do full processing once a week and no dedupe during week days to avoid heavy processing. However I would like to explore incremental dedupe option and weight pros/cons. Any input is highly appreciated! -- Regards, Rishi Shah