Hi all, I've been experimenting with using riak to index a large amount of log data collected from a bunch of different app instances across different machines. I have our app code instrumented such that it attaches secondary indexes to log entries based on some interesting metadata (for example, the date, the thread id, the hostname, the identity of the user on whose behalf we were doing something, if appropriate) and then submits them to riak. So far I have this working against a 1-node riak cluster on a very small slice of production log data, which obviously doesn't really add much benefit. Time to see about scaling it up.
Ultimately, I'd like my database to reflect an N-most-recent-days window of our logs, to make querying them easier than grepping gigabytes and gigabytes of logs across dozens of machines. The secondary indexes are especially appealing, because the most common task is "give me all the logs associated with this user across all machines for a given date window". This seems like a problem riak is well suited for, given the appropriate secondary indexes. Having no riak sizing experience to speak of and no outside guidance, my approach was basically going to be to start out with a 3 or 5 node cluster of SoftLayer's "small" riak nodes (see http://www.softlayer.com/solutions/big-data/riak-hosting ) or comparable hardware, then start shoveling data into it and see how large a window I can retain (and query against with reasonable performance) while still writing at full blast (assuming I can actually write full blast to it -- that remains to be seen). But then I realized there are probably a few people on this list that might be able to give me at least a rough recommendation if I can give some details on the data load. The average log entry is around 140 bytes of message and maybe another 60 bytes of metadata for secondary indexes. We churn out about 400 million of these log entries per day, so in the neighborhood of 4500 per second. Is this something we should be able to handle on a smallish riak cluster using a LevelDB backend? I'm trying to puzzle out just how much this scheme will end up costing us. Also, what would be a good approach for pruning items as they get outside the sliding N-day window? TTL? Delete query by date? Will this be expensive? I've also seen some threads recently about LevelDB never actually shrinking when data is deleted. Is that a problem I'll run into quickly? Thanks in advance for any guidance you can give. Even if the advice is "give it up, just use <x>, which was designed for exactly this", I'm interested in that type of response, too. Maybe I'm overlooking something much easier. Stuff like splunk is worth consideration, although I'm a pretty big believer in reducing dependencies on outside services. I'm also happy to provide more details on our use case if what I've provided isn't enough. _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com