Peter,

wrote my replies inline.

Mathias Meyer
Developer Advocate, Basho Technologies


On Freitag, 13. Mai 2011 at 20:05, Peter Fales wrote:

> Sean,
> 
> Thanks to you and Ben for clarifying how that works. Since that was 
> so helpful, I'll ask a followup question, and also a question on 
> a mostly un-related topic...
> 
> 1) When I've removed a couple of nodes and the remaining nodes pick up 
> the slack, is there any way for me to look under the hood and see that?
> I'm using wget to fetch the '.../stats' URL from one of the remaing 
> live nodes, and under ring_ownership it still lists the original 4
> nodes, each one owning 1/4 or the total partitions. That's part of
> reason why I didn't think the data ownership had been moved.
> 
Ring ownership is only affected by nodes explicitly entering and leaving the 
cluster. Unless you explicitly tell the cluster to remove a node, or explicitly 
tell that node to leave the cluster, ownership will remain the same even in 
case of a failure on one or more nodes. Data ownership is moved around 
implicitly in case of failure. By looking at the preference list, the 
coordinating node simply picks the next node(s) to pick up the slack for the 
failed one(s).

The only way to find out if a handoff is currently happening between any two 
nodes is to look at the logs. They'll indicate beginning and end of a transfer. 
The cluster state and therefore the stats don't take re-partitioning or handoff 
into account yet.
> 2) My test involves sending a large number of read/write requests to the 
> cluster from multiple client connections and timing how long each request
> takes. I find that the vast majority of the requests are processed 
> quickly (a few milliseconds to 10s of milliseconds). However, every once
> in while, the server seems to "hang" for a while. When that happens
> the response can take several hundred milliseconds or even several 
> seconds. Is this something that is known and/or expected? There 
> doesn't seem to be any pattern to how often it happens -- typically 
> I'll see it a "few" times during a 10-minute test run. Sometimes
> it will go for several minutes without a problem. I haven't ruled
> out a problem with my test client, but it's fairly simple-minded C++
> program using the protocol buffers interface, so I don't think there
> is too much that can go wrong on that end.
> 
Easiest to find out if the problem is something stalling is to look at the 
stats and the percentiles for put and get fsms, which are responsible for 
taking care of reads and writes. Look for the JSON keys node_get_fsm_time_* and 
node_put_fsm_time_*. If anything jumps out here during and shortly after your 
benchmark run, something on the Riak or EC2 end is probably waiting for 
something else.

Are you using EBS in any way for storing Riak's data? If so, what kind of setup 
do you have, single volume or RAID? 

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to