Hi Wolf,

It looks like a couple of things are going on.  I suspect you're hitting an
issue with leveldb that we've identified and corrected where compactions
can sometimes build up and block writing.  This is enough to cause your
handoff timeout issue.  If you are affected you'll have entries in the LOG
files in the leveldb dirs that say 'Waiting...' or you'll see large build
ups of Compacting NNN@0 where NNN > 8.  The fix will be included with the
next point release of Riak.

The busy_dist_port messages are due to buffers filling up in distributed
erlang.  This may be caused by the leveldb issue if multiple vnodes are
backed up on puts and you don't have very many cores.  It may also be due
to network load.  You can increase the size of the buffers by adding +zdbbl
16384 to your vm.args file.  That will give 16Mb buffers rather than the
default 1Mb.

The node is actually exiting due to a race condition around restarting
vnodes with leveldb.  We changed the locking scheme to be per-file handle
for leveldb in 1.2.0.  Previously it was just per-OS process.  When the
handoff died I think it killed the erlang process running that vnode.  The
race exists around relaunching a process to handle the dead vnode.

Delving into details for a moment, this is what I suspect the problem is.
 The driver we wrote for leveldb uses erlang NIF resources which are
reference counted.  When the process exits the reference goes to zero and
eventually the erlang virtual machine calls our cleanup code that closes
the database.  However the process that relaunches the vnode is completely
decoupled from that cleanup.  If the leveldb database has not been closed
yet it will now get the lock error you see in the logs.

Whenever riak can't start a vnode it fails safe and shuts itself down.  In
a cluster, a down node is much better than a sick node.  We're planning to
change Riak to be more tolerant of crashes for transient problems like
this, however doing it right will take some thought.

Best,

Jon

On Wed, Sep 12, 2012 at 12:57 AM, Wolf Iem <wolf...@gmail.com> wrote:

> Hi all,
>
> We have 4 nodes Riak installation. They are running on Ubuntu 12.04 LTS
> Precise installed servers.
> We have installed 1.1.4 at August 1st 2012 and upgraded 1.2.0 when its
> available.
>
> Server names are:
>
> f1 - 10.10.0.12 - This is the first installed server. We have joined other
> ones to this server. This also serves Riak control.
> s2 - 10.10.0.22 -
> s3 - 10.10.0.23 -
> s4 - 10.10.0.24 - This server also serves Riak control.
>
> This servers randomly shutdown. Yesterday s4 was shutdown.
>
>
>


-- 
Jon Meredith
Platform Engineering Manager
Basho Technologies, Inc.
jmered...@basho.com
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to