Hi Nathan, One alternative to the pure 2i-based solution for this would be time boxing. Sean referenced it a few months back on the list [1] and it's worth investigating. There are a few other resources I'm failing to remember at the moment but I'll send them along tomorrow if I do. That said, 2i will most-likely work for your queries, too. I would prototype both and let performance testing be your guide.
On the topic is cluster sizing, it's tough to pin it precisely before you're up and running. That said, I would start with five of the Softlayer Smalls at the very least. Hope that helps. Mark twitter.com/pharkmillups PS - You might also want to experiment with lower N, R, and W values as log data tends to be immutable and you can pick up some performance gains by cutting down on how many replicas you're storing and querying. [1] (Dietrich's talk Sean links to is a great resource) http://riak.markmail.org/search/?q=timebox#query:timebox+page:1+mid:e3a7ivrn5eyw3vtz+state:results On Sat, Oct 19, 2013 at 11:47 AM, N. Tucker <ntucker-ml-riak-us...@august20th.com> wrote: > Hi all, I've been experimenting with using riak to index a large > amount of log data collected from a bunch of different app instances > across different machines. I have our app code instrumented such that > it attaches secondary indexes to log entries based on some interesting > metadata (for example, the date, the thread id, the hostname, the > identity of the user on whose behalf we were doing something, if > appropriate) and then submits them to riak. So far I have this > working against a 1-node riak cluster on a very small slice of > production log data, which obviously doesn't really add much benefit. > Time to see about scaling it up. > > Ultimately, I'd like my database to reflect an N-most-recent-days > window of our logs, to make querying them easier than grepping > gigabytes and gigabytes of logs across dozens of machines. The > secondary indexes are especially appealing, because the most common > task is "give me all the logs associated with this user across all > machines for a given date window". This seems like a problem riak is > well suited for, given the appropriate secondary indexes. > > Having no riak sizing experience to speak of and no outside guidance, > my approach was basically going to be to start out with a 3 or 5 node > cluster of SoftLayer's "small" riak nodes (see > http://www.softlayer.com/solutions/big-data/riak-hosting ) or > comparable hardware, then start shoveling data into it and see how > large a window I can retain (and query against with reasonable > performance) while still writing at full blast (assuming I can > actually write full blast to it -- that remains to be seen). > > But then I realized there are probably a few people on this list that > might be able to give me at least a rough recommendation if I can give > some details on the data load. The average log entry is around 140 > bytes of message and maybe another 60 bytes of metadata for secondary > indexes. We churn out about 400 million of these log entries per day, > so in the neighborhood of 4500 per second. > > Is this something we should be able to handle on a smallish riak > cluster using a LevelDB backend? I'm trying to puzzle out just how > much this scheme will end up costing us. Also, what would be a good > approach for pruning items as they get outside the sliding N-day > window? TTL? Delete query by date? Will this be expensive? I've > also seen some threads recently about LevelDB never actually shrinking > when data is deleted. Is that a problem I'll run into quickly? > > Thanks in advance for any guidance you can give. Even if the advice > is "give it up, just use <x>, which was designed for exactly this", > I'm interested in that type of response, too. Maybe I'm overlooking > something much easier. Stuff like splunk is worth consideration, > although I'm a pretty big believer in reducing dependencies on outside > services. I'm also happy to provide more details on our use case if > what I've provided isn't enough. > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com