The nodes have 8G - so well more then the recommended value. The configuration was at the default of 1G - I now changed it to 2G.
Chaim Solomon On Mon, Aug 11, 2014 at 6:14 PM, Eric Redmond <eredm...@basho.com> wrote: > If Solr is stumbling over bad data, your node's solr.log should be filled > up. If Yokozuna is stumbling over bad data that it's trying to send Solr in > a loop, the console.log should be full. If yokozuna is going ahead and > indexing bad values (such as unparsable json), it will go ahead and index a > blank object with _yz_err (just search for existence). If you have a case > of sibling explosion, you'll have many duplicates of the same object with > different _yz_vtag fields (again search for existence). > > You said it's not a resource issue, but just to rule that out, how much > RAM does each node have? Also, how much is made available to Solr? You can > adjust the max heap size given to Solr in riak.conf, by > changing search.solr.jvm_options max heap size values from -Xmx1g to -Xmx2g > or more. > > Eric > > > On Aug 11, 2014, at 8:03 AM, Chaim Solomon <ch...@itcentralstation.com> > wrote: > > Hi, > > I don't think that it is a resource issue now. > > After removing the data, the other nodes had low load and are handling the > workload just fine. > And the Java process - when it crashed - was really dead, on shutting down > Riak it stayed around and needed a -9 to go away. > > I don't think the disks are a problem but rather suspect that a crash may > have caused Solr to stumble over bad data and then crash. > > Chaim Solomon > > > > On Mon, Aug 11, 2014 at 5:47 PM, Jordan West <jw...@basho.com> wrote: > >> Chaim, >> >> Some comments inline: >> >> On Mon, Aug 11, 2014 at 4:14 AM, Chaim Solomon < >> ch...@itcentralstation.com> wrote: >> >>> Hi, >>> >>> I've been running into an issue with the yz search acting up. >>> >>> I've been getting a lot of these: >>> >>> 2014-08-11 06:45:22.005 [error] <0.913.0>@yz_kv:index:206 failed to >>> index object {<<"bucketname">>,<<"123">>} with error {"Failed to index >>> docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"s >>> >>> rc/yz_solr.erl"},{line,192}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,258}]},{yz_kv,index,3,[{file, >>> >>> "src/yz_kv.erl"},{line,193}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1416}]}, >>> >>> {riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1404}]},{riak_kv_vnode,do_put,7,[{fil >>> >>> e,"src/riak_kv_vnode.erl"},{line,1199}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"} >>> >>> ,{line,485}]},{riak_core_vnode,vnode_command,3,[{file,"src/riak_core_vnode.erl"},{line,345}]}] >>> >>> and the Java process uses a lot of CPU and eventually runs out of memory >>> or something like that and gets stuck. Killing the process gets the cluster >>> back up and running. >>> >>> I am guessing that it may be data corruption on the yz data on one node. >>> >>> Clearing away the yz data on that node and restarting riak makes the >>> system work again - and I guess AAE will rebuild the index. >>> >>> >> This sounds very similar to the issue last week. I would certainly like >> to rule out any sort of data corruption (are you thinking your disks are >> corrupting the data or are you assuming Solr is?). >> >> However, it is also possible, like the last issue, that the node/cluster >> simply does not have enough memory. When you delete the data Solr no longer >> has anything to cache in-memory thus using significantly less. As >> discussed, the recommended minimum >> >> >>> But I'm wondering why a crashing Java on one node practically takes down >>> the search on the cluster. Shouldn't Riak be more resilient than that? >>> >> >> The hard part here is, at least initially, the Java process doesn't >> crash, it just starts to timeout. In distributed systems a slow-node is >> often worse than a down node. Riak, prior to 1.4 had something called >> "health check" that would mark a node down in this situation. Unfortunately >> in some workloads, and I believe given your cluster's limited resources it >> would happen here, this often results in excessive work being offloaded to >> another node, which also does not have sufficient resources and around we >> go until the entire cluster falls over. A capacity problem, typically, can >> only be solved by adding more capacity. >> >> >>> >>> Is there a explicit reindex command for the full text search subsystem? >>> >>> Could Riak keep an eye on the java process and restart it if it crashes >>> or runs away? >>> >>> >> Riak does manage the JVM process (starting/stopping/restarting) .I agree >> that if we could include run-away process, like in your case, that would be >> even better. I would have to think a bit more about how this would work (to >> prevent the same problems mentioned above with the old-style health check) >> >> Jordan >> >> >> >>> Chaim Solomon >>> >>> >>> >>> _______________________________________________ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >> > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com