First of all, let's eliminate CouchDB. Its WOL filesystem is irrelevant for this is generated data. If a document gets lost (for whatever reason) - it can be regenerated. Topmost though, CouchDB can't run dynamic queries. That's what the field indexes are for. MongoDB is happy to store & query my data. The assumption is that "Riak Search" will also be able to search it, once released. Is there a reason to think MongoDB will be better than Riak for range search (indexed floats)?

My reasons for using Riak:

* will use it for other buckets anyway (hence reuse of skills & infrastructure) * would like to add the "distributed redundancy" as soon as I can afford more servers for the #bigdata bucket * can postpone search until end of this Q3 (as advertised) or Q4, perhaps even longer * the dataset will eventually expand to be very much bigger (real big) and I rather not shard

I'm sure the "cannot tell riak where to place buckets" can be worked around... I have some ideas, please tell me which are possible and which are recommended:

1. I write directly to the big node (rather than through a load-balancing proxy) - would it pass this data to other nodes for a bucket of N=1? 2. I detach the big node from the cluster when writing, temporarily falling back to a 2 node cluster and a solo node. This can happen during (the relatively safer) western nighttime (which is my daytime) and things will be monitored closely. Once the node joins back into the cluster, would it have reasons to move any of the big bucket data to the other small nodes? 3. Can I run 2 instances (nodes) of Riak on the big server? One for the cluster of 3 nodes (little data). And one (single node cluster) for the big data bucket. 4. I'd rather save $240 / year (one less Linode 512) -- but it's no big deal to spend either. After I do the big upgrades (some day) - are there tools to migrate the big bucket from the solo node cluster (into the other cluster)?

I basically want to keep the big data isolated in any way possible. If something goes terribly wrong (e.g. corrupt bucket or server explodes?) - no harm done to me. I just boot a fresh instance / node form backup. Whatever state the big data was (when last backed up) is always good enough for this kind of data.

The order I listed these in - is my preferred order. Separate clusters (#3 or #4) is last option, because (ideally) I'd like to be able to link to the big bucket (though I probably shouldn't link until there's redundancy for that data). But since I can't think of any good reasons - maybe it's better to run separate clusters for starters, and merge them in the future? Would Riak let me do #1 or #2? Or perhaps there is a better way?

Orlin


Alexander Sicular wrote:
- You can not tell riak where to place buckets.
- You could set the N val on a bucket to one, and you should in the case of 
your 'big bucket'. Otherwise you will get N replicas on the same physical host.
-Use linode. 512>  256 = better.

But in reality , your use case doesnt mesh well with what riak is all about. 
distributed redundancy. I would use couchdb for your 'big bucket' of data. 
Couch uses a write only log (wol) filesystem with an incremental b-tree index 
for map reduce. This may work better for you.

-Alexander

On Aug 12, 2010, at 6:44 AM, Orlin Bozhinov wrote:

I can easily wait for Riak Search to do this 
http://groups.google.com/group/mongodb-user/browse_thread/thread/c2563a8566591a30/b3d19f21675a899e
 - instead of mongodb.  Does this deployment I have in mind make sense:

I'll get a medium (or large) Linode box for the big dataset bucket.  Hopefully you can 
give me an idea about how much RAM I'll need for that.  This is batch-generated data.  It 
takes time to generate, but (once added) it will not change.  Because of that I'd like to 
save some money and not replicate it.  I plan to have 2 other small Linode servers and 
run a Riak cluster of 3.  Can I tell Riak to keep the big bucket exclusively on the big 
server?  It will be used only for queries.  So if the server crashes, I can just reboot 
it, expecting the same data back up.  Because it's a single-node bucket (if that's even 
possible to have in a cluster), I probably still won't be "linking" to it from 
other buckets (so when it fails, the impact is minimal).  Or maybe I should keep it in a 
separate (single node) cluster anyway?

Cluster separation means I can run the smaller cluster elsewhere.  The Joyent + 
Riak news is very exciting!  I couldn't afford to put the big bucket dataset on it 
(another reason to have 2 clusters) and I'd have to go with the smallest 
SmartMachines for starters.  Would 256 MB RAM be good enough (just for Riak)?  
What kind of load can that handle?  I'm also tempted to just run everything on 
Linode.  It's about 3 times cheaper (as far memory goes) and the upgrades are less 
dramatic.  Would you recommend that (for low-budget)?  I imagine there will be an 
easy (Linode ->  Joyent) Riak migration path...

Orlin

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to