Neville, I'm not sure how you mean. The network gear is all functional, otherwise I wouldn't be able to interact with the machines at all (they're at our colo). But as far as I understand, if I hard reboot a box (or, in a real-world scenario, the pdu fails), the switch will happily continue forwarding packets into nothingness, causing HTTP requests to hang indefinitely until they time out. From what Dan said, I would expect that Riak handles that sort of situation intelligently. I guess my remaining questions are:
* How does Riak detect that a node is down, and what could cause that to take a full minute? * When N=3, what about a single node failure could cause a read with R=1 to time out? * Is there a way to configure the strictness of when nodes are assumed dead? I'm thinking like a "timeout" config option or something. Peace, --Jay On Tue, Nov 23, 2010 at 2:55 PM, Neville Burnell <neville.burn...@gmail.com>wrote: > Just a thought ... have you verified your switch, cables, nics, etc > > > On 24 November 2010 09:33, Jay Adkisson <j4yf...@gmail.com> wrote: > >> (many profuse apologies to Dan - hit "reply" instead of "reply all") >> >> Alrighty, I've done a little more digging. When I throttle the writes >> heavily (2/sec) and set R and W to 1 all around, the cluster works just fine >> after I restart the node for about 15-20 seconds. Then the read request >> hangs for about a minute, until node D disappears from connected_nodes in >> riak-admin status, at which point it returns the desired value (although >> sometimes I get a 503): >> >> --2010-11-23 13:*01:28*-- http://<node A>:8098/riak/<bucket>/<key>?r=1 >> Resolving <node A>... <ip addr> >> Connecting to <node A>|<ip addr>|:8098... connected. >> HTTP request sent, awaiting response... *<hang...> *200 OK >> Length: 3684 (3.6K) [image/jpeg] >> Saving to: `<key>?r=1' >> >> 100%[======================================>] 3,684 --.-K/s in 0s >> >> 2010-11-23 13:*02:21* (49.5 MB/s) - `<key>?r=1' saved [3684/3684] >> >> --2010-11-23 13:02:23-- http://<node A>:8098/riak/<bucket>/<key>?r=1 >> Resolving <node A>... <ip addr> >> Connecting to <node A>|<ip addr>|:8098... connected. >> HTTP request sent, awaiting response... 200 OK >> Length: 3684 (3.6K) [image/jpeg] >> Saving to: `<key>?r=1' >> >> 100%[======================================>] 3,684 --.-K/s in 0s >> >> 2010-11-23 13:02:23 (220 MB/s) - `<key>?r=1' saved [3684/3684] >> >> Afterwards, node D comes back up and re-joins the cluster seamlessly. >> >> Any insights? >> >> --Jay >> >> On Mon, Nov 22, 2010 at 5:59 PM, Jay Adkisson <j4yf...@gmail.com> wrote: >> >>> Hey Dan, >>> >>> Thanks for the response! I tried it again while watching `riak-admin >>> status` - basically, it takes about 30 seconds of node C being down before >>> riak realizes it's gone. During that time, if I'm writing to the cluster at >>> all (I throttled it to 2 writes per second for testing), both writes and >>> reads hang indefinitely, and sometimes time out. >>> >>> I'm using Ripple to do the writes, and wget to test reads, all on node A >>> for now, since I know it'll be up. I'm using the default R and W options >>> for now. >>> >>> Thanks for the help and clarification around ringready. >>> >>> --Jay >>> >>> >>> On Mon, Nov 22, 2010 at 5:15 PM, Dan Reverri <d...@basho.com> wrote: >>> >>>> Your HTTP calls should not being timing out. Are you sending requests >>>> directly to the Riak node or are you using a load balancer? How much load >>>> are you placing on node A? Is it a write only load or are there reads as >>>> well? Can you confirm "all" requests time out or is it a large subset of >>>> the >>>> requests? How large are the objects being written? Are you setting R and W >>>> in the request? Are you using a particular client (Ruby, Python, etc.)? Can >>>> you provide the output of "riak-admin status" from node A? >>>> >>>> Regarding the ringready command; that is behaving as I would expect >>>> considering a node is down. >>>> >>>> Thanks, >>>> Dan >>>> >>>> Daniel Reverri >>>> Developer Advocate >>>> Basho Technologies, Inc. >>>> d...@basho.com >>>> >>>> >>>> On Mon, Nov 22, 2010 at 4:55 PM, Jay Adkisson <j4yf...@gmail.com>wrote: >>>> >>>>> Hey all, >>>>> >>>>> Here's what I'm seeing: I have four nodes A, B, C, and D. I'm loading >>>>> lots of data into node A, which is being distributed evenly across the >>>>> nodes. If I physically reboot node D, all my HTTP calls time out, and >>>>> `riak-admin ringready` complains that not all nodes are up. Is this >>>>> intended behavior? Is there a configuration option I can set so it fails >>>>> more gracefully? >>>>> >>>>> --Jay >>>>> >>>>> _______________________________________________ >>>>> riak-users mailing list >>>>> riak-users@lists.basho.com >>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>> >>>>> >>>> >>> >> >> _______________________________________________ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com