Hi, I come up with this idea of riak schema and wanted to know if you see some bad sides of that.
So, I have those devices, which generate logs (these are documents with stats). Each log is tagged with timestamp. If we store it like that, in a bucket some-device-name, retrieving data is pretty hard. I know that mapred bucket is faster and faster but it's not there, yet. User needs data from Time1 to Time2. Potentially we can keep Timestamps as ints and just execute reads on every integer in this interval - but its not good if it is long interval and we could have e.g. only 2 logs inside. So the idea is to keep something like index per bucket. Logs would be sequenced and PUT in device-name-bucket with key = counter. Counter gets incremented with every message. In the meantime, there is a parallel process (on timer?) measuring time. Every minute (or any other frequency = highest resolution of the User's query) it asks about current counter value and saves it under current time-stamp in index_bucket. When User comes and asks: "Give me logs from device1 from Time1 to Time2" we go to index_bucket retrieve Time1.value = counter1 and Time2.value = counter2. And that's it! We now have exact interval of keys to query on device-name-bucket. It may be 2 messages when i.e. counter1=113 and counter2 = 115 or a million messages, but we won't have any mis-hits in read. It looks pretty efficient. Sequencing writes may be troubling (why to maintain some queues), so we can have many buckets for a device each with its own index and counter. Time1 or Time2 can be human readable so it also helps. Do you like it? Or you maybe have some different approaches to this kind if problem? Maybe I miss something? cheers
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com