Hi Jeff, I believe you are encountering BZ 1097 (http://issues.basho.com/1097), where a suddenly truncated bitcask file can cause problems when attempting to merge. The truncation is typically the result of underlying O/S or hardware failure and simply means that the last record in a bitcask file didn't get fully written. Generally, bitcask recovers from this by ignoring the last incomplete record, but there was a case in the merging (fixed by this bug report) where this didn't happen properly.
So, you have a few options: 1. You can restore your last known good bitcask directory on this node. This is the easiest fix and the other Riak nodes will read-repair any out-of-date values as the data is accessed. 2. You can grab the latest bitcask source, build it and drop that in place on the bjorked node. (Replacing the existing bitcask code). This is a bit more legwork (since compilation is involved), but should allow the node to recover without further intervention. Hope that helps, D. On Fri, Aug 5, 2011 at 2:12 AM, Jeff Pollard <jeff.poll...@gmail.com> wrote: > Hey All, > We had one of our riak node servers crash, and when booted back up it's now > in this very inconsistent state where it responds to requests for a while > (minute or two), then all requests time out for a little while, then go back > to not responding to requests. It's been ~90 minutes since the crash and > reboot of the server, and we're still in this bad state. > We use the bitcask data store, and looking through the logs I see a lot of > merge failures in the sasl-error.log file. See this gist for the tail -n > 2000 of the sasl-error.log. The interesting bit is mostly at the bottom: > https://gist.github.com/1127104 > > I'm not really sure how to proceed and would love some help on the matter. > For the time being we have this node pulled out of our load balancer and > the rest of the nodes see this node as down, so we're still functional in > production, but I'd obviously like to fix this up ASAP. > One final thing to note is that we have backups of the entire Riak data > directory from before the crash, which we could restore from if that helps. > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > -- Dave Smith Director, Engineering Basho Technologies, Inc. diz...@basho.com _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com