In my application, I have a frequent need to store collections of data where I rarely access the items individually, but much more commonly in aggregate form. In principle, the mapreduce feature is well suited to this, but it must be fed a set of {bucket,key} pairs to consume, and maintaining these collections is painful.
The naive approach would be to store each collection in its own bucket, and use listkeys to fetch the keys that I can then mapreduce over. However, listkeys has red flags all over it, presumably because it's implemented as a scan+filter of all buckets and keys. The slightly less naive approach would be to store the keys for a collection in a list. Unfortunately, for a collection of even modest size (10,000 elements), the cost of retrieving and storing this index to make small changes (e.g. "I'm adding one element to the collection") rapidly becomes untenable, and that's before such knotty issues as "how do I delete items from this index?" arise. This is essentially the approach taken by the riak_link_index project. A more subtle approach would be to maintain a hierarchical index, with a "current" index that is modified whenever items are added, and promoted to "frozen" once it reaches a certain size. Frozen indices would be maintained in an index of their own, which itself would be frozen, and so on. Performance would be maintained by limiting the maximum size of each index object. This is a similar scheme to the traditional filesystem approach of direct, single indirect, and double indirect block indexing. It's also quite a lot of work to implement. Does Basho have any plans to address this? It seems likely to be a very frequently arising need.
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com