Hi Josh, 1024 is too large of a ring size for 10 nodes. If it's possible to rebuild your cluster using a ring size of 128 or 256 that would be ideal (http://docs.basho.com/riak/latest/ops/building/planning/cluster/#Ring-Size-Number-of-Partitions). Ring resizing is possible as well (http://docs.basho.com/riak/latest/ops/advanced/ring-resizing/).
Have all of our recommended performance tunings been applied to every node in this cluster? (http://docs.basho.com/riak/latest/ops/tuning/linux/) - these can have a dramatic effect on cluster performance. -- Luke Bakken Engineer lbak...@basho.com On Tue, Jan 5, 2016 at 10:52 AM, Josh Yudaken <j...@smyte.com> wrote: > Hi, > > We're attempting to use Riak as our primary key-value and search > database for an analytics-typed solution to blocking spam/fraud. > > As we expect to eventually be handling a huge amount of data, I > started with a ring size of 1024. We currently have 10 nodes on Google > Cloud n1-standard-16 instances [ 16 cores, 60gb RAM, 720gb local ssd. > ]. Disks are at about 60% usage [ roughly 175gb leveldb, 16gb yz, 45gb > anti_entropy, 6gb yz_anti_entropy ], and request wise we're at about > 20k/min get, 4k/min set. Load average is usually around 6. > > I'm assuming most of the issues we're seeing are Yokozuna related, but > we're seeing a ton of tcp timeouts during handoffs, very slow get/set > queries, and a slew of other errors. > > Right now I'm trying to debug an issue where one of the 10 nodes > pegged all the cpu cores. Mostly with the `bean` process. > > # riak-admin top > Output server crashed: connection_lost > > With few other options (as it was causing slow queries across the > cluster) I stopped the server and saw hundreds of the following > (interesting) messages in the log:: > > 2016-01-05 18:28:28.573 [info] > <0.4958.0>@yz_index_hashtree:close_trees:557 Deliberately marking YZ > hashtree {1458647141945490998441568260777384029383167049728,3} for > full rebuild on next restart > > As well as a ton of (I think related?): > 2016-01-05 18:28:31.153 [error] <0.5982.0>@yz_kv:index_internal:237 > failed to index object > {{<<"features">>,<<"features">>},<<"0NKqMtj3O6_">>} with error > {noproc,{gen_server,call,[yz_entropy_mgr,{get_tree,1120389438774178506630754486017853682060456099840},infinity]}} > because > [{gen_server,call,3,[{file,"gen_server.erl"},{line,188}]},{yz_kv,get_and_set_tree,1,[{file,"src/yz_kv.erl"},{line,452}]},{yz_kv,update_hashtree,4,[{file,"src/yz_kv.erl"},{line,340}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,295}]},{yz_kv,index_internal,5,[{file,"src/yz_kv.erl"},{line,224}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1619}]},{riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1607}]},{riak_kv_vnode,do_put,7,[{file,"src/riak_kv_vnode.erl"},{line,1398}]}] > > For reference the TCP timeout error looks like: > > 2016-01-01 01:09:50.522 [error] > <0.8430.6>@riak_core_handoff_sender:start_fold:272 hinted transfer of > riak_kv_vnode from 'riak@riak25-2.c.authbox-api.internal' > 185542200051774784537577176028434367729757061120 to > 'riak@riak27-2.c.authbox-api.internal' > 185542200051774784537577176028434367729757061120 failed because of TCP > recv timeout > > Any suggestions about where to look? > > Regards, > Josh _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com