Re: Whole cluster times out if one node is gone

2010-11-29 Thread Dan Reverri
If I continuously read from the node that I am rebooting, the request made to that node hangs until the client times out, subsequent requests receive a "Failed to connect" error. I am using curl for my tests. Thanks, Dan Daniel Reverri Developer Advocate Basho Technologies, Inc. d...@basho.com

Re: Whole cluster times out if one node is gone

2010-11-29 Thread Alexander Sicular
You may have mentioned which client you are using (the thread is deep already) but I would think that this is a client implementation problem. As in some sort of connection pooling thing. Try calling curl from a sleep loop in a shell script and see what happens. -Alexander On Mon, Nov 29, 2010 at

Re: Whole cluster times out if one node is gone

2010-11-29 Thread Jay Adkisson
Hm, that's curious. Are you rebooting the physical machine? When you reboot one of the nodes, what happens to HTTP calls to that node? Do they immediately error, or do they hang indefinitely? In the meanwhile, I'll add some logging so I can see whether I'm timing out on the writes as well, and

Re: Whole cluster times out if one node is gone

2010-11-29 Thread Dan Reverri
Hi Jay, I'm not able to reproduce the behavior you are seeing. Here is what I am doing to try to reproduce the issue: 1. Setup a 4 node cluster 2. Continuously write a new object to Riak every 0.5 second 3. Continuously read a known object (GET riak/test/1) from Riak every 0.5 second 4. Reboot one

Re: Whole cluster times out if one node is gone

2010-11-29 Thread Jay Adkisson
Hey Dan/Sean, Thanks for the response. sasl-error.log on node A is completely empty, and I see this pattern in erlang.log: = ALIVE Tue Nov 23 12:46:57 PST 2010 = Tue Nov 23 12:57:36 PST 2010 =ERROR REPORT 23-Nov-2010::12:57:36 === ** Node 'riak@' not responding ** ** Removing (time

Re: Whole cluster times out if one node is gone

2010-11-29 Thread David Smith
On Tue, Nov 23, 2010 at 3:33 PM, Jay Adkisson wrote: > (many profuse apologies to Dan - hit "reply" instead of "reply all") > Alrighty, I've done a little more digging.  When I throttle the writes > heavily (2/sec) and set R and W to 1 all around, the cluster works just fine > after I restart the

Re: Whole cluster times out if one node is gone

2010-11-27 Thread Sean Cribbs
1) Riak detects node outage the same way any Erlang system does - when a message fails to deliver, or the heartbeat maintained by epmd fails. The default timeout in epmd is 1 minute, which is probably why you're seeing it take 1 minute to be detected. 2) If it takes too long (the vnode is overl

Re: Whole cluster times out if one node is gone

2010-11-27 Thread Jay Adkisson
Neville, I'm not sure how you mean. The network gear is all functional, otherwise I wouldn't be able to interact with the machines at all (they're at our colo). But as far as I understand, if I hard reboot a box (or, in a real-world scenario, the pdu fails), the switch will happily continue forwa

Re: Whole cluster times out if one node is gone

2010-11-23 Thread Neville Burnell
Just a thought ... have you verified your switch, cables, nics, etc On 24 November 2010 09:33, Jay Adkisson wrote: > (many profuse apologies to Dan - hit "reply" instead of "reply all") > > Alrighty, I've done a little more digging. When I throttle the writes > heavily (2/sec) and set R and W t

Re: Whole cluster times out if one node is gone

2010-11-23 Thread Jay Adkisson
(many profuse apologies to Dan - hit "reply" instead of "reply all") Alrighty, I've done a little more digging. When I throttle the writes heavily (2/sec) and set R and W to 1 all around, the cluster works just fine after I restart the node for about 15-20 seconds. Then the read request hangs fo

Re: Whole cluster times out if one node is gone

2010-11-22 Thread Jay Adkisson
Hey Dan, Thanks for the response! I tried it again while watching `riak-admin status` - basically, it takes about 30 seconds of node C being down before riak realizes it's gone. During that time, if I'm writing to the cluster at all (I throttled it to 2 writes per second for testing), both write

Re: Whole cluster times out if one node is gone

2010-11-22 Thread Dan Reverri
Your HTTP calls should not being timing out. Are you sending requests directly to the Riak node or are you using a load balancer? How much load are you placing on node A? Is it a write only load or are there reads as well? Can you confirm "all" requests time out or is it a large subset of the reque

Whole cluster times out if one node is gone

2010-11-22 Thread Jay Adkisson
Hey all, Here's what I'm seeing: I have four nodes A, B, C, and D. I'm loading lots of data into node A, which is being distributed evenly across the nodes. If I physically reboot node D, all my HTTP calls time out, and `riak-admin ringready` complains that not all nodes are up. Is this intende