You may have mentioned which client you are using (the thread is deep already) but I would think that this is a client implementation problem. As in some sort of connection pooling thing. Try calling curl from a sleep loop in a shell script and see what happens.
-Alexander On Mon, Nov 29, 2010 at 13:27, Jay Adkisson <j4yf...@gmail.com> wrote: > Hm, that's curious. Are you rebooting the physical machine? When you > reboot one of the nodes, what happens to HTTP calls to that node? Do they > immediately error, or do they hang indefinitely? > In the meanwhile, I'll add some logging so I can see whether I'm timing out > on the writes as well, and I'll see what happens with different keys. > Thanks, > --Jay > > On Mon, Nov 29, 2010 at 10:02 AM, Dan Reverri <d...@basho.com> wrote: >> >> Hi Jay, >> I'm not able to reproduce the behavior you are seeing. Here is what I am >> doing to try to reproduce the issue: >> 1. Setup a 4 node cluster >> 2. Continuously write a new object to Riak every 0.5 second >> 3. Continuously read a known object (GET riak/test/1) from Riak every 0.5 >> second >> 4. Reboot one of the nodes >> The reads and writes continue working normally when rebooting the node. >> Do you see timeouts while writing objects to Riak? >> Can you try reading other objects from Riak during the reboot (i.e. >> different keys)? >> Thanks, >> Dan >> Daniel Reverri >> Developer Advocate >> Basho Technologies, Inc. >> d...@basho.com >> >> >> On Mon, Nov 29, 2010 at 9:39 AM, Jay Adkisson <j4yf...@gmail.com> wrote: >>> >>> Hey Dan/Sean, >>> >>> Thanks for the response. sasl-error.log on node A is completely empty, >>> and I see this pattern in erlang.log: >>> ===== ALIVE Tue Nov 23 12:46:57 PST 2010 >>> ===== Tue Nov 23 12:57:36 PST 2010 >>> =ERROR REPORT==== 23-Nov-2010::12:57:36 === >>> ** Node 'riak@<node D>' not responding ** >>> ** Removing (timedout) connection ** >>> =INFO REPORT==== 23-Nov-2010::12:58:41 === >>> Starting handoff of partition riak_kv_vnode >>> 251195593916248939066258330623111144003363405824 to 'riak@<node D>' >>> =INFO REPORT==== 23-Nov-2010::12:58:41 === >>> Handoff of partition riak_kv_vnode >>> 251195593916248939066258330623111144003363405824 to 'riak@<node D>' >>> completed: sent 1 objects in 0.02 seconds >>> =INFO REPORT==== 23-Nov-2010::12:59:18 === >>> Starting handoff of partition riak_kv_vnode >>> 707914855582156101004909840846949587645842325504 to 'riak@<node D>' >>> =INFO REPORT==== 23-Nov-2010::12:59:18 === >>> Handoff of partition riak_kv_vnode >>> 707914855582156101004909840846949587645842325504 to 'riak@<node D>' >>> completed: sent 5 objects in 0.03 seconds >>> =INFO REPORT==== 23-Nov-2010::12:59:20 === >>> Starting handoff of partition riak_kv_vnode >>> 525227150915793236229449236757414210188850757632 to 'riak@<node D>' >>> <handoffs, etc...> >>> This is my testing process: I'm doing an initial load into riak of small >>> image files between 1 and 150K, throttled to two images per second, with >>> W=1. In a different terminal, I'm running a wget every second against node >>> A of one particular image I already know to be in the cluster, again with >>> R=1. I'm using R,W=1 because I figured that would reduce the chance of >>> timing out, and with my data pattern, nothing I write to the cluster will >>> ever change, so I really don't need to wait for a quorum. >>> In response to Sean, >>>> >>>> 1) Riak detects node outage the same way any Erlang system does - when a >>>> message fails to deliver, or the heartbeat maintained by epmd fails. The >>>> default timeout in epmd is 1 minute, which is probably why you're seeing it >>>> take 1 minute to be detected. >>> >>> Thanks, this is enlightening. >>>> >>>> 2) If it takes too long (the vnode is overloaded, perhaps, or is just >>>> starting up as a hint partition) to retrieve from any node, the request can >>>> time out. >>> >>> That makes sense, but I still wonder why this happens even when the >>> quorum is already met by the machines that are responding normally? >>> >>>> >>>> 3) You could probably configure epmd to timeout sooner, but then you >>>> become more vulnerable to temporary partitions. YMMV >>> >>> I may try that - it might be a good fit with my data pattern. >>> Thanks again, >>> --Jay >>> >>> On Mon, Nov 29, 2010 at 4:44 AM, David Smith <diz...@basho.com> wrote: >>>> >>>> On Tue, Nov 23, 2010 at 3:33 PM, Jay Adkisson <j4yf...@gmail.com> wrote: >>>> > (many profuse apologies to Dan - hit "reply" instead of "reply all") >>>> > Alrighty, I've done a little more digging. When I throttle the writes >>>> > heavily (2/sec) and set R and W to 1 all around, the cluster works >>>> > just fine >>>> > after I restart the node for about 15-20 seconds. Then the read >>>> > request >>>> > hangs for about a minute, until node D disappears from connected_nodes >>>> > in >>>> > riak-admin status, at which point it returns the desired value >>>> > (although >>>> > sometimes I get a 503): >>>> >>>> Are you seeing any error messages in log/erlang.log.* or >>>> log/sasl-error.log? >>>> >>>> Can you expound on your use case a little -- are you doing a large >>>> insert, or just a random read/write mix? Did you pre-populate the >>>> dataset? Why are you using r=1, instead of relying on quorom for >>>> reads? >>>> >>>> How are you running the riak-admin status to measure the 15-20 seconds? >>>> >>>> Thanks. >>>> >>>> D. >>> >> > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com