Hi, I don't think that it is a resource issue now.
After removing the data, the other nodes had low load and are handling the workload just fine. And the Java process - when it crashed - was really dead, on shutting down Riak it stayed around and needed a -9 to go away. I don't think the disks are a problem but rather suspect that a crash may have caused Solr to stumble over bad data and then crash. Chaim Solomon On Mon, Aug 11, 2014 at 5:47 PM, Jordan West <jw...@basho.com> wrote: > Chaim, > > Some comments inline: > > On Mon, Aug 11, 2014 at 4:14 AM, Chaim Solomon <ch...@itcentralstation.com > > wrote: > >> Hi, >> >> I've been running into an issue with the yz search acting up. >> >> I've been getting a lot of these: >> >> 2014-08-11 06:45:22.005 [error] <0.913.0>@yz_kv:index:206 failed to index >> object {<<"bucketname">>,<<"123">>} with error {"Failed to index >> docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"s >> >> rc/yz_solr.erl"},{line,192}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,258}]},{yz_kv,index,3,[{file, >> >> "src/yz_kv.erl"},{line,193}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1416}]}, >> >> {riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1404}]},{riak_kv_vnode,do_put,7,[{fil >> >> e,"src/riak_kv_vnode.erl"},{line,1199}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"} >> >> ,{line,485}]},{riak_core_vnode,vnode_command,3,[{file,"src/riak_core_vnode.erl"},{line,345}]}] >> >> and the Java process uses a lot of CPU and eventually runs out of memory >> or something like that and gets stuck. Killing the process gets the cluster >> back up and running. >> >> I am guessing that it may be data corruption on the yz data on one node. >> >> Clearing away the yz data on that node and restarting riak makes the >> system work again - and I guess AAE will rebuild the index. >> >> > This sounds very similar to the issue last week. I would certainly like to > rule out any sort of data corruption (are you thinking your disks are > corrupting the data or are you assuming Solr is?). > > However, it is also possible, like the last issue, that the node/cluster > simply does not have enough memory. When you delete the data Solr no longer > has anything to cache in-memory thus using significantly less. As > discussed, the recommended minimum > > >> But I'm wondering why a crashing Java on one node practically takes down >> the search on the cluster. Shouldn't Riak be more resilient than that? >> > > The hard part here is, at least initially, the Java process doesn't crash, > it just starts to timeout. In distributed systems a slow-node is often > worse than a down node. Riak, prior to 1.4 had something called "health > check" that would mark a node down in this situation. Unfortunately in some > workloads, and I believe given your cluster's limited resources it would > happen here, this often results in excessive work being offloaded to > another node, which also does not have sufficient resources and around we > go until the entire cluster falls over. A capacity problem, typically, can > only be solved by adding more capacity. > > >> >> Is there a explicit reindex command for the full text search subsystem? >> >> Could Riak keep an eye on the java process and restart it if it crashes >> or runs away? >> >> > Riak does manage the JVM process (starting/stopping/restarting) .I agree > that if we could include run-away process, like in your case, that would be > even better. I would have to think a bit more about how this would work (to > prevent the same problems mentioned above with the old-style health check) > > Jordan > > > >> Chaim Solomon >> >> >> >> _______________________________________________ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com