Hi Joe, You observation regarding question 5 is correct. The coordinating FSM would attempt to send the request to the failed vnode and receive either an error or no reply. A request may still succeed if enough of the other vnodes respond; "enough" would be determined by the "r", "w", "dw", or "rw" setting of the request. Handoff would not occur in this scenario.
Thanks, Dan Daniel Reverri Developer Advocate Basho Technologies, Inc. [email protected] On Wed, Mar 23, 2011 at 5:58 PM, Joseph Blomstedt <[email protected]> wrote: > > Sorry, I don't have a lot of time right now. I'll try to write a more > detailed response later. > > >>> With a few hours of investigation today, your patch is looking > >>> promising. Maybe you can give some more detail on what you did in your > >>> experiments a few months ago? > > I'll try to write something up when I have the time. I need to find my > notes. In general, the focus was mostly on performance tuning, > although I did look into error/recovery a bit as well. My main goal at > the time was trying to reduce disk seeks as much as possible. Bitcask > is awesome as it is an append only store, but if you have multiple > bitcasks being written to on the same disk you still end up with disk > seeking depending on how the underlying file system works. I was > trying to mitigate this as much as possible, given a project that used > bitcask in a predominately write-only mode (basically as a transaction > log that was only written to; read only in failure conditions). BTW, > concerning RAID, I recall seeing better performance spreading vnode > bitcasks across several smaller RAID arrays than using a single larger > RAID array during write-heavy bursts. > > >>> Oh, one thing I noticed is that while Riak starts up, if there's a bad > >>> disk then it will shutdown (the whole node), at this line: > >>> > >>> > >>> https://github.com/jtuple/riak_kv/blob/jdb-multi-dirs/src/riak_kv_bitcask_backend.erl#L103 > >>> > >>> > >>> That makes sense, but I'm wondering if it's possible to let the node > >>> start since some of its vnodes would be able to open their bitcasks just > >>> fine. I wonder if it's as simple as removing that line? > >>> > > You don't want to remove that line, riak expects the vnode to come > online or kill the entire node. You would need to have a vnode failure > trigger an ownership change if you really wanted things to behave > properly. > > The better case is to not have the vnode fail if there are other > existing disks. That's an easy change that I'll throw together when I > have time. Basically, when a vnode starts, have it pick a bitcask > directory, if that directory fails, then have it pick a different > directory. If all configured directories fail, then call riak:stop. > Thus, if a disk fails and a vnode restarts, it should create a new > empty bitcask on a working disk. Then read repair will slowly rewrite > your data depending on data access (handoff won't occur though, unless > that's added in patch). > > > After reading todays recap, I am a bit unsure: > > > >> 5) Q --- Would Riak handle an individual vnode failure the same way as > >> an entire node failure? (from grourk via #riak) > >> > >> A --- Yes. The request to that vnode would fail and will be routed > >> to the next available vnode > > > > Is it really handled the same way? I don't believe handoff will occur. The > > R/W values still apply of course, but I think there will be one less replica > > of the keys that map to the failed vnode until the situation. > > I have delved quite a bit into the riak code, but if I really missed > > something I would be glad if someone could point me to the place where a > > vnode failure is detected. As far as I can see, the heavy lifting happens in > > riak_kv_util:try_cast/5 ( > > https://github.com/basho/riak_kv/blob/riak_kv-0.14.1/src/riak_kv_util.erl#L78), > > which only checks if the whole node is up. > > I don't think handoff occurs either. Maybe folks at Basho can look > into this further, or someone can test it. I'll test it > tonight/tomorrow if I have the time. It looks like the cast will > occur, but never return. So, your overall write may fail depending on > your W-val. Is there something we're both missing here? > > -Joe > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
