G'day!
We've been running with 1.3.1 for most of this week. Generally it's been
going well. We especially feel happier knowing that Active Anti-Entropy
is keeping an eye on things. As we mostly use map reduce queries we
rarely triggered any read repairs so it's good that we'll be getting
repairs from now on. Nice work!
However, there's a few things that have popped up that I'd be interested
in getting some advice about.
===
Firstly, as mentioned in an earlier message[1] (that seems to have
fallen on deaf ears :-) ) we had a couple of 1.2.1 nodes crash when I
upgraded one of the other nodes to 1.3.1. The current theory is that I
made the mistake of installing the new Riak package on all the nodes
before starting the upgrade. When I restarted the first node it started
doing its handoff checks. The two 1.2.1 nodes that had vnode replicas of
the new 1.3.1 node tried to start their riak_core_handoff_receiver
functions. The only thing I can think of is that the 1.2.1 nodes didn't
actually have those functions in memory so went to disk to load them.
Because I'd upgraded the Riak software, but hadn't restarted it yet, it
couldn't find the module files it was expecting so it failed. That's the
theory, anyway. So, tip of the day, don't upgrade your software until
you're ready to restart it!
===
Secondly, we've noticed a significant change in our FSM times since
upgrading[2]. The red-ish lines are 95th percentile "puts" from our four
nodes. The blue-ish lines are "gets". We were averaging a stable sub-2ms
for puts before the upgrade and now we're closer to 4ms with a lot of
jitter. The gets are unchanged. Is this related to active anti-entropy?
The AAE trees have been indexed but we're still seeing that puts are slower.
===
Finally, we've started seeing the following error occasionally pop up on
various nodes:
[error] <0.212.0> Supervisor riak_pipe_fitting_sup had child undefined
started with riak_pipe_fitting:start_link() at <0.4459.767> exit with
reason noproc in context shutdown_error
According to riak_pipe issue #49 on GitHub[3] the problem has been
around since 1.1.2 but we're only seeing it since upgrading to 1.3.1. It
doesn't seem to be load related and we don't get any associated errors
in our application and it is happening less than once per day. Anything
we should be worrying about?
Thanks!
Shane.
[1]
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-June/012237.html
[2] http://i.imgur.com/ucZRTBR.png
[3] https://github.com/basho/riak_pipe/issues/49
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com