Hi again Chris, Another option would be to have a look at using a Merkle Tree to quickly drill down to the differences. This is actually what Cassandra uses internally when running a repair between different nodes.
Cheers, Jens On Wed, Sep 7, 2016 at 9:47 AM <ch...@cmartinit.co.uk> wrote: > First off I hope this appropriate here- I couldn't decide whether this was > a question for Cassandra users or spark users so if you think it's in the > wiring place feel free to redirect me. > > I have a system that does a load of data manipulation using spark. The > output of this program is a effectively the new state that I want my > Cassandra table to be in and the final step is to update Cassandra so that > it matches this state. > > At present I'm currently inserting all rows in my generated state into > Cassandra. This works for new rows and also for updating existing rows but > doesn't of course delete any rows that were already in Cassandra but not in > my new state. > > The problem I have now is how best to delete these missing rows. Options I > have considered are: > > 1. Setting a ttl on inserts which is roughly the same as my data refresh > period. This would probably be pretty performant but I really don't want to > do this because it would mean that all data in my database would disappear > if I had issues running my refresh task! > > 2. Every time I refresh the data I would first have to fetch all primary > keys from Cassandra and, compare them to primary keys locally to create a > list of pks to delete before the insert. This seems the most logicaly > correct option but is going to result in reading vast amounts of data from > Cassandra. > > 3. Truncating the entire table before refreshing Cassandra. This has the > benefit of being pretty simple in code but I'm not sure of the performance > implications of this and what will happen if I truncate while a node is > offline. > > For reference the table is on the order of 10s of millions of rows and for > any data refresh only a very small fraction (<.1%) will actually need > deleting. 99% of the time I'll just be overwriting existing keys. > > I'd be grateful if anyone could shed some advice on the best solution here > or whether there's some better way I haven't thought of. > > Thanks, > > Chris > -- Jens Rantil Backend Developer @ Tink Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden For urgent matters you can reach me at +46-708-84 18 32.