Responses inline. --- Jeremiah Peschka Managing Director, Brent Ozar PLF, LLC
On Wed, Feb 22, 2012 at 2:10 PM, <char...@contentomni.com> wrote: > I'm building an online tool/app that is heavily dependent on messaging. This > messaging is simple text, nothing complicated, and it will take place > between my server back-end and the desktop/device. These messages would be > very easy to store in Riak. > > > > Each message is created after a specific user event e.g. a user posts a > request, etc. In turn, each message created could spawn another 200 to 3,000 > messages (based on some other social networking features I can't say too > much about to keep this short). I believe, in this case, we can assume each > message will be a Riak Object. > > > > All tolled, from my estimation, I'm looking at 400,000 messages/objects > generated per user per year. With an estimated active user base of 20 > million (I hope some day), that would be 8 billion keys generated each year. > The size of each object is about 2Kb max. So that works out about 16 > Terabytes of data generated per year. > > > > 1. Is Riak a good fit for this solution going up to and beyond 20 million > users (i.e. terabytes upon terabytes added per year)? > I think Riak is a good fit for this solution in terms of the ability to handle data size. > > > 2. I plan to use 2i, which means I would be using the LevelDB backend. Will > this be reasonably performant for billions of keys added each year? > > LevelDB is a good backend fit, especially for when the size of your keyspace exceeds the size of RAM. > > 3. I'm using what I have here > (http://wiki.basho.com/Cluster-Capacity-Planning.html) as my guide for > capacity planning. I plan on using Rackspace Cloud Servers for this specific > project. Can I just keep adding servers as the size of my data grows?! > This planning guide is aimed at planning for Bitcask specifically, but most of the advice applies You can keep adding servers, but you need to be careful about the initial size of your ring. The ring size defaults to 64 virtual nodes and it can't be changed once you put data in the cluster, so you'll need to do some careful planning up front. Having more virtual nodes will enabled you to safely increase the size of your ring. I believe the current guidance is that you want no fewer 10 v-nodes per physical server in the cluster. Also, I seem to recall reading that you want to make sure the number of v-nodes is a power of two. Going by this, you'll want to start with 2048 v-nodes, which could prove somewhat problematic on a small cluster. > > 4. From the guide mentioned in 3 above, it appears I will need about 400 > [4GbRAM 160GbHDD] servers for 20 million users (assuming an n_val of 4). > This means I would need to add 20 servers annually for each million active > users I add. Is it plausible to have an n_val of 4 for this many servers?! > Wouldn't going higher just mean I'd have to add many more servers > needlessly?! > I'm not sure I understand the question. Basically, each node in the cluster is aware of where data belongs. When you query a Riak node, it'll route the request to the nodes that should have the key in question. With an n_val of 4 and say 200 servers, you'll still be querying a maximum of 5 servers (one for whichever node coordinates the request, and up to 4 servers sending data back). With a large number of servers, I would be more concerned about traffic around the ring. However, some of the changes outlined in Riak 1.1's release notes make me think that it isn't that big of a concern. As an aside, 4GB of RAM and a 160GB HDD sounds like the specs on low end cable box. You can avoid having 200+ servers by using servers with more RAM and more drives. It's something to plan for in the long run, but you can fit an incredible amount of storage and RAM into some server chassis. E.g. the Dell C2200 can hold 192GB of RAM and many TB of storage - 12 bays in chassis - and the server doesn't cost that much in the grand scheme of things. > > > 5. Should I put all my keys in one bucket (considering I'm using 2i, does it > matter)?! > Buckets are a logical namespace - use as many or as few as you want. Of course, using buckets could make it easier to logically move some of your data to another cluster if you find that one cluster can't handle the load. > > > I'd appreciate some assistance with this. > A word of warning: I/O is your enemy in shared hosting environments. Be wary of Rackspace's I/O pipeline. Most cloud providers are using low end commodity servers with low end commodity storage in the back end. That means you're going to share a host server with multiple tenants and you'll be sharing the same single crappy Broadcom ethernet port with everyone else on that box and you will most likely be sharing the same Dell EqualLogic or EMC Isilon (I think Rackspace use the Isilon unless you ask for a VMAX). Point is: you'll have a terribly narrow and shared pipeline to your disk subsystem. Expect your I/O to be in the 70MB/s or lower rate. Or... what you'd expect from a USB flash drive. Edit: Rackspace allege to be using local storage, so you'll be fighting with everyone else on your server for access to the same four 7200 RPM drives ;) Again, expect terrible performance and you won't be disappointed. > > > Thanks. > > > > > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com