Hi Wolf, It looks like a couple of things are going on. I suspect you're hitting an issue with leveldb that we've identified and corrected where compactions can sometimes build up and block writing. This is enough to cause your handoff timeout issue. If you are affected you'll have entries in the LOG files in the leveldb dirs that say 'Waiting...' or you'll see large build ups of Compacting NNN@0 where NNN > 8. The fix will be included with the next point release of Riak.
The busy_dist_port messages are due to buffers filling up in distributed erlang. This may be caused by the leveldb issue if multiple vnodes are backed up on puts and you don't have very many cores. It may also be due to network load. You can increase the size of the buffers by adding +zdbbl 16384 to your vm.args file. That will give 16Mb buffers rather than the default 1Mb. The node is actually exiting due to a race condition around restarting vnodes with leveldb. We changed the locking scheme to be per-file handle for leveldb in 1.2.0. Previously it was just per-OS process. When the handoff died I think it killed the erlang process running that vnode. The race exists around relaunching a process to handle the dead vnode. Delving into details for a moment, this is what I suspect the problem is. The driver we wrote for leveldb uses erlang NIF resources which are reference counted. When the process exits the reference goes to zero and eventually the erlang virtual machine calls our cleanup code that closes the database. However the process that relaunches the vnode is completely decoupled from that cleanup. If the leveldb database has not been closed yet it will now get the lock error you see in the logs. Whenever riak can't start a vnode it fails safe and shuts itself down. In a cluster, a down node is much better than a sick node. We're planning to change Riak to be more tolerant of crashes for transient problems like this, however doing it right will take some thought. Best, Jon On Wed, Sep 12, 2012 at 12:57 AM, Wolf Iem <wolf...@gmail.com> wrote: > Hi all, > > We have 4 nodes Riak installation. They are running on Ubuntu 12.04 LTS > Precise installed servers. > We have installed 1.1.4 at August 1st 2012 and upgraded 1.2.0 when its > available. > > Server names are: > > f1 - 10.10.0.12 - This is the first installed server. We have joined other > ones to this server. This also serves Riak control. > s2 - 10.10.0.22 - > s3 - 10.10.0.23 - > s4 - 10.10.0.24 - This server also serves Riak control. > > This servers randomly shutdown. Yesterday s4 was shutdown. > > > -- Jon Meredith Platform Engineering Manager Basho Technologies, Inc. jmered...@basho.com
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com