On 02/22/2012 02:10 PM, char...@contentomni.com wrote:
1. Is Riak a good fit for this solution going up to and beyond 20 million users (i.e. terabytes upon terabytes added per year)?
The better question might be: what do you actually plan to do with that much data?
2. I plan to use 2i, which means I would be using the LevelDB backend. Will this be reasonably performant for billions of keys added each year? 3. I'm using what I have here (http://wiki.basho.com/Cluster-Capacity-Planning.html) as my guide for capacity planning. I plan on using Rackspace Cloud Servers for this specific project. Can I just keep adding servers as the size of my data grows?!
Riak clusters have a functional upper limit of around a hundred nodes; inter-node traffic dominates at that level. That said, at that scale it's gonna be WAY cheaper to run your own HW. If I were speccing a cluster for 16tb of data/year:
Data warehouse (huge dark pool, variable latency tolerable): six, say, Sun Thumpers running ZFS on BSD, 48 TB maximum capacity per box, ~60 TB of usable storage in Riak at N=4, assuming ZFS parity as well. Start small, add drives progressively. 24 rack units total.
Hot cluster (small dark pool, IO latencies critical): Six nodes with 1x 10.24 TB FusionIO Octals apiece, 15TB usable immediately, add additional FIO cards to each node as you grow.
The answer isn't scale out or scale up. You can scale *diagonally* and get the benefits of both.
As you grow, rotate in new nodes with bigger hard drives, more memory, more processors, more bandwidth. Drive upgrades are cheap: just shut down the box, install the new HW, and bring it back again. Riak is *good* at this; the other nodes will bring the original box up to speed when it's back. We've rotated in new drives on our six node cluster 3 times now, and are about to do it again.
When virtualized HW becomes the bottleneck (and I guarantee that is much sooner than you think), spread out onto physical nodes. When commodity spinning disks are too slow (will also happen sooner than you think), rotate in SSDs. Then exotic solid-state HW. At every stage you can add more nodes with the same class of HW, but there will come an equilibrium point when bigger is cheaper than more.
4. From the guide mentioned in 3 above, it appears I will need about 400 [4GbRAM 160GbHDD] servers for 20 million users (assuming an n_val of 4). This means I would need to add 20 servers annually for each million active users I add. Is it plausible to have an n_val of 4 for this many servers?! Wouldn't going higher just mean I'd have to add many more servers needlessly?!
You can choose whatever n_val you like, up to the number of servers you have. Data volume scales linearly with n_val, so it takes 33% more space for n_val 4 over n_val 3.
5. Should I put all my keys in one bucket (considering I'm using 2i, does it matter)?!
It doesn't really matter. Buckets are just a part of the key: riak keys are actually [bucket, key]. Use them for namespacing.
--Kyle _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com