Jorn,
Thanks for the response. My downstream database is Kudu.
1. Yes. As you have suggested, I have been using a central caching mechanism
that caches the rdd results and to make a comparison with the next batch to
check for the latest timestamps and ignore the old timestamps. But, I see
handlin
What DB do you have?
You have some options, such as
1) use a key value store (they can be accessed very efficiently) to see if
there has been a newer key already processed - if yes then ignore value if no
then insert into database
2) redesign the key to include the timestamp and find out the la