I figure the following is not currently possible in Riak, so I'd like to propose them as potential features.
We are processing millions of documents with Riak and we need to very quickly find that ones we want. One problem we've had is that some of the terms we want to search on are either very low cardinality (such as timestamps), or very high cardinality (e.g. they match 50% of documents). This is compounded by the very low IO performance of EC2. We've deal with this by creating the equivalent of compound indexes by adding fields to our documents that combine two or more fields we wanted to index and that we could query together. We've also performed some quantization of things like timestamps. This has allows us bring the cardinality terms to a reasonable level and spread our the values among terms more evenly. We are using Riak search for this, as secondary indexes does not currently support compound queries. The downside of this approach is that if we want to answer a question such as, more recent documents from some point in time, we have to loop over these terms trying to find one with some data. That can potentially involve a lot of round trips between the client and the server. Ideally we'd like to be able to write a MR job that takes some input, and if the starting input was not sufficient to produce a result, have it restart at the beginning of the job with a new set of inputs it generates programmatically. This can be somewhat emulated by having repeating groups chains of map-reduce phases, so if an early group did not find what it was looking for, the last reduce phase group can output a new set of bucket-key pairs as input for the next group of map-reduce phases. The problem with this is that is not efficient (the first group could have found what it needed, and the other phases becomes superflous), you may not know head of time how many of this groups to chain, and most problematic for us, we may not know what bucket-key pairs to pass to the next group. Thus, you need looping, but also a way to specify a new input, not just with bucket-key pairs, but as search or secondary indexes query. I suppose the last problem can be solved if we where using Erlang for our phases, as we could then use the Erlang search client to generate a new set of keys based on a new search query. Maybe the a search API could be made available as a new Javascript function that can be used by JS phases. Elias
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com