v1.0.3 search merge index preventing partition handoff?

Fisher, Ryan Wed, 25 Jan 2012 12:18:51 -0800

Hello all,

We are hitting an issue with a riak 1.0.3 cluster when adding new nodes to
the ring.  Specifically the handoff appears stuck and isn't making any
progress.


I have read a number of the threads on here and realize handoff will take a
while, and have also tried attaching to the console and doing a force_update
along w/ force_handoffs.  However over 12 hours later the nodes haven't made
any progress.  After digging through the log files it appears that the
search merge_index could be my problem?  Possibly the compaction isn't
occurring properly?

We are running a riak 1.0.3 cluster for a research project, where we are
utilizing the python client for reads, writes, and queries of the cluster.
Using a small data set of 20k keys things were humming along nicely.

We then started to ramp up the number of objects and ended up getting to
around 1M objects.  At this same time I added an additional node (w/ plans
to expand to 8 nodes total).

However it appears that the partition handoff is stuck after performing the
'join' command on the 5th node I was adding.

So currently it is a 4 + 1 node cluster with 4 gig of memory per node, am
running the bitcask backend with 'search' enabled on some of the buckets.
Specifically I am using the 'out of the box' JSON encoding schema by simply
setting the mime-type to "application/json", when I do the store from the
python client.

I'm wondering if enabling search and using the default JSON schema was too
much data to index?  Outside of increasing the linux file limit on the
nodes, enabling 'search' (in the config file and w/ the pre-commit hook),
and upping the ring_creation_size to 256 (before I started or added any
nodes) there shouldn't be much else out of the ordinary going on.  This was
an original 1.0 riak cluster which I have been performing rolling upgrades
on as the bug fix versions come out.  However currently all 4 + 1 nodes are
1.0.3

Here are the *I hope* relevant error logs?

Riak error log:
http://pastebin.com/99cdPdCk

Riak crash log:
http://pastebin.com/07FRZkf2

Riak erlang log:
http://pastebin.com/DvdasWyR

Does anyone have any ideas on how to 'unstick' the partition handoff?  Or
maybe the bigger question is indexing all of the incoming data (outside of
the disk space requirements) a bad idea?  Perhaps I need to write a custom
schema that limits what gets indexed?

I should mention that the search is a 'nice-to-have' but the data is
structured in a way that we know the keys we need at lookup time (for the
most part) and I can probably use m/r to query the rest With that I'm
wondering if it comes down to it can search be easily 'undone' on the
cluster?  Maybe as simply as disabling the pre-commit hook, turning it off
in the app.config and them deleting the riak/merge_index directories on each
node?  


Thanks,
ryan

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

v1.0.3 search merge index preventing partition handoff?

Reply via email to