I'm having trouble visualising how to maintain a secondary index on a mutable 
column. For instance, given some objects and the number that we have in 
inventory:

widget1: { count = 0, colour = blue }
widget2: { count = 5, colour = red }
widget3: { count = 8, colour = green }
widget4: { count = 8, colour = purple }
widget5: { count = 9, colour = green }
widget6: { count = 2, colour = red }
widget7: { count = 1, colour = blue }

If i want a list of widgets ordered by the number that we have, I might 
construct an ordered columnfamily that looks like:

count009_widget5: { id = widget5 }
count008_widget4: { id = widget4 }
count008_widget3: { id = widget3 }
count005_widget2: { id = widget2 }
count002_widget6: { id = widget6 }
count001_widget7: { id = widget7 }
count000_widget1: { id = widget1 }

And do slices against that CF. (The additional _widget5 bit of the key is just 
so that duplicates don't clobber each other.) This works great if the 
inventory-count is immutable with respect to the ID (like a timestamp, as in 
the Twissandra example, which does exactly this), but if it's not then whenever 
we update the count, we have to also update the secondary index CF. Let's say 
we want to increment the inventory count of widget2. This presents some 
problems:

1. We're going to have to remove the old count005_widget2 row (action A) and 
add a count006_widget2 row (action B). If we do A first, then B, then for some 
period widget2 doesn't show up in our inventory report at all. If we do B first 
then A, then it shows it twice for a bit. If this is a heavily-trafficed CF, 
then at any given time there are likely to be many such inconsistencies adding 
up and querying a slice is unlikely to be useful information at all.

2. How do we find the old column name "count005_widget2" to remove it? If it's 
a trafficked column we can't use the current count before the increment because 
that's rife for race-conditions, and if it's a big CF we can't scan the whole 
thing looking for that ID. We could devise a hack by guessing that the 
inventory is unlikely to change by more than N% at a time, and pull the slice 
count004*.. count010* and only scan that slice for the deletions, but that 
imposes a new race-condition where we have simultaneous agents incrementing the 
count, decrementing it, and incrementing it (that is, the decrement may delete 
the row that the second increment created, and now none of them are in the 
index).

Has anybody devised a toolkit or guidelines for working with this situation?

Reply via email to