I'm having trouble visualising how to maintain a secondary index on a mutable
column. For instance, given some objects and the number that we have in
inventory:
widget1: { count = 0, colour = blue }
widget2: { count = 5, colour = red }
widget3: { count = 8, colour = green }
widget4: { count = 8, colour = purple }
widget5: { count = 9, colour = green }
widget6: { count = 2, colour = red }
widget7: { count = 1, colour = blue }
If i want a list of widgets ordered by the number that we have, I might
construct an ordered columnfamily that looks like:
count009_widget5: { id = widget5 }
count008_widget4: { id = widget4 }
count008_widget3: { id = widget3 }
count005_widget2: { id = widget2 }
count002_widget6: { id = widget6 }
count001_widget7: { id = widget7 }
count000_widget1: { id = widget1 }
And do slices against that CF. (The additional _widget5 bit of the key is just
so that duplicates don't clobber each other.) This works great if the
inventory-count is immutable with respect to the ID (like a timestamp, as in
the Twissandra example, which does exactly this), but if it's not then whenever
we update the count, we have to also update the secondary index CF. Let's say
we want to increment the inventory count of widget2. This presents some
problems:
1. We're going to have to remove the old count005_widget2 row (action A) and
add a count006_widget2 row (action B). If we do A first, then B, then for some
period widget2 doesn't show up in our inventory report at all. If we do B first
then A, then it shows it twice for a bit. If this is a heavily-trafficed CF,
then at any given time there are likely to be many such inconsistencies adding
up and querying a slice is unlikely to be useful information at all.
2. How do we find the old column name "count005_widget2" to remove it? If it's
a trafficked column we can't use the current count before the increment because
that's rife for race-conditions, and if it's a big CF we can't scan the whole
thing looking for that ID. We could devise a hack by guessing that the
inventory is unlikely to change by more than N% at a time, and pull the slice
count004*.. count010* and only scan that slice for the deletions, but that
imposes a new race-condition where we have simultaneous agents incrementing the
count, decrementing it, and incrementing it (that is, the decrement may delete
the row that the second increment created, and now none of them are in the
index).
Has anybody devised a toolkit or guidelines for working with this situation?