It sounds like you understood perfectly. Basically we are running a cluster of machines that are busy doing lots of stuff. We wanted to use Riak to keep configuration information about those machines and the stuff they were doing. So Riak would be running on machines whose primary job is something else. A critical use case for us is to figure out what needs to be done on which other machines after one of the machines goes down. Therefore having the potential to have our data unavailable during a failover because of the failover kills the benefit that we wanted from a high availability system.
We've chosen to go with the simple approach of a relational database on external hardware in a high availability setup. We didn't want that dependency, but we've done enough now that we're committed to it. On Thu, Jun 9, 2011 at 7:33 PM, Ryan Zezeski <rzeze...@basho.com> wrote: > Ben, > I hate non-obvious behavior too, and it's something we constantly try to > fight at Basho. That said, I don't think Riak is in as bad a position as > you think. Lets see if I can convince you :) > If I'm understanding you correctly you are making two points here: > 1) When performing a join/leave under load most GETs return 404 until data > transfer has completed. > 2) A node in the cluster has failed and that is causing data to become > unavailable. > Assuming these are indeed your claims I counter... > 1) Yes, performing a join/leave **can** cause reads to return 404s. Just > ask Greg Nelson and he can tell you all about it. However, I want to > emphasize the **can** qualifier here. It depends on the # of nodes you are > going from->to. The reason this matters is b/c this number will affect how > the claim algorithm behaves and how much data actually shifts around. > Now I can hear you saying "Yea, but that's still brittle/broken!" Yes, I > agree 100% with the words I just put in your mouth. My point is simply that > there are shades of grey here and depending on how many nodes you have you > might never hit this case (note that 3-5 nodes **will** hit this case). We > are actively working on a solution to this problem as we recognize it's > seriousness and very much want to see it fixed. > 2) This should absolutely not be happening. This is Riak's bread and butter > use case, i.e. high availability. My guess is I'm misunderstanding what you > are saying. > -Ryan > > > > On Thu, Jun 9, 2011 at 8:00 PM, Ben Tilly <bti...@gmail.com> wrote: >> >> I am not a developer advocate. But my top hate is that when machines >> leave/rejoin your data can be inaccessable for some time. >> >> We had a great case where we wanted to use Riak, but that was a >> complete showstopper and we won't be using it because of that. (We >> wanted to store information which needed to be read in the event of a >> machine failing. But the machine that could fail would be on the same >> cluster that was running Riak, so we'd be potentially trying to do >> reads exactly when data was unavailable.) >> >> On Thu, Jun 9, 2011 at 10:25 AM, Srdjan Pejic <spe...@gmail.com> wrote: >> > What do you guys hate about Riak right now? >> > _______________________________________________ >> > riak-users mailing list >> > riak-users@lists.basho.com >> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > >> > >> >> _______________________________________________ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com