Hi Joe,

You observation regarding question 5 is correct. The coordinating FSM
would attempt to send the request to the failed vnode and receive
either an error or no reply. A request may still succeed if enough of
the other vnodes respond; "enough" would be determined by the "r",
"w", "dw", or "rw" setting of the request. Handoff would not occur in
this scenario.

Thanks,
Dan

Daniel Reverri
Developer Advocate
Basho Technologies, Inc.
[email protected]


On Wed, Mar 23, 2011 at 5:58 PM, Joseph Blomstedt
<[email protected]> wrote:
>
> Sorry, I don't have a lot of time right now. I'll try to write a more
> detailed response later.
>
> >>> With a few hours of investigation today, your patch is looking
> >>> promising. Maybe you can give some more detail on what you did in your
> >>> experiments a few months ago?
>
> I'll try to write something up when I have the time. I need to find my
> notes. In general, the focus was mostly on performance tuning,
> although I did look into error/recovery a bit as well. My main goal at
> the time was trying to reduce disk seeks as much as possible. Bitcask
> is awesome as it is an append only store, but if you have multiple
> bitcasks being written to on the same disk you still end up with disk
> seeking depending on how the underlying file system works. I was
> trying to mitigate this as much as possible, given a project that used
> bitcask in a predominately write-only mode (basically as a transaction
> log that was only written to; read only in failure conditions). BTW,
> concerning RAID, I recall seeing better performance spreading vnode
> bitcasks across several smaller RAID arrays than using a single larger
> RAID array during write-heavy bursts.
>
> >>> Oh, one thing I noticed is that while Riak starts up, if there's a bad
> >>> disk then it will shutdown (the whole node), at this line:
> >>>
> >>>
> >>> https://github.com/jtuple/riak_kv/blob/jdb-multi-dirs/src/riak_kv_bitcask_backend.erl#L103
> >>>
> >>>
> >>> That makes sense, but I'm wondering if it's possible to let the node
> >>> start since some of its vnodes would be able to open their bitcasks just
> >>> fine. I wonder if it's as simple as removing that line?
> >>>
>
> You don't want to remove that line, riak expects the vnode to come
> online or kill the entire node. You would need to have a vnode failure
> trigger an ownership change if you really wanted things to behave
> properly.
>
> The better case is to not have the vnode fail if there are other
> existing disks. That's an easy change that I'll throw together when I
> have time. Basically, when a vnode starts, have it pick a bitcask
> directory, if that directory fails, then have it pick a different
> directory. If all configured directories fail, then call riak:stop.
> Thus, if a disk fails and a vnode restarts, it should create a new
> empty bitcask on a working disk. Then read repair will slowly rewrite
> your data depending on data access (handoff won't occur though, unless
> that's added in patch).
>
> > After reading todays recap, I am a bit unsure:
> >
> >> 5) Q --- Would Riak handle an individual vnode failure the same way as
> >> an entire node failure? (from grourk via #riak)
> >>
> >>    A --- Yes. The request to that vnode would fail and will be routed
> >> to the next available vnode
> >
> > Is it really handled the same way? I don't believe handoff will occur. The
> > R/W values still apply of course, but I think there will be one less replica
> > of the keys that map to the failed vnode until the situation.
> > I have delved quite a bit into the riak code, but if I really missed
> > something I would be glad if someone could point me to the place where a
> > vnode failure is detected. As far as I can see, the heavy lifting happens in
> > riak_kv_util:try_cast/5 (
> > https://github.com/basho/riak_kv/blob/riak_kv-0.14.1/src/riak_kv_util.erl#L78),
> > which only checks if the whole node is up.
>
> I don't think handoff occurs either. Maybe folks at Basho can look
> into this further, or someone can test it. I'll test it
> tonight/tomorrow if I have the time. It looks like the cast will
> occur, but never return. So, your overall write may fail depending on
> your W-val. Is there something we're both missing here?
>
> -Joe
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to