Hi,

I don't think that it is a resource issue now.

After removing the data, the other nodes had low load and are handling the
workload just fine.
And the Java process - when it crashed - was really dead, on shutting down
Riak it stayed around and needed a -9 to go away.

I don't think the disks are a problem but rather suspect that a crash may
have caused Solr to stumble over bad data and then crash.

Chaim Solomon



On Mon, Aug 11, 2014 at 5:47 PM, Jordan West <jw...@basho.com> wrote:

> Chaim,
>
> Some comments inline:
>
> On Mon, Aug 11, 2014 at 4:14 AM, Chaim Solomon <ch...@itcentralstation.com
> > wrote:
>
>> Hi,
>>
>> I've been running into an issue with the yz search acting up.
>>
>>  I've been getting a lot of these:
>>
>> 2014-08-11 06:45:22.005 [error] <0.913.0>@yz_kv:index:206 failed to index
>> object {<<"bucketname">>,<<"123">>} with error {"Failed to index
>> docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"s
>>
>> rc/yz_solr.erl"},{line,192}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,258}]},{yz_kv,index,3,[{file,
>>
>> "src/yz_kv.erl"},{line,193}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1416}]},
>>
>> {riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1404}]},{riak_kv_vnode,do_put,7,[{fil
>>
>> e,"src/riak_kv_vnode.erl"},{line,1199}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"}
>>
>> ,{line,485}]},{riak_core_vnode,vnode_command,3,[{file,"src/riak_core_vnode.erl"},{line,345}]}]
>>
>> and the Java process uses a lot of CPU and eventually runs out of memory
>> or something like that and gets stuck. Killing the process gets the cluster
>> back up and running.
>>
>> I am guessing that it may be data corruption on the yz data on one node.
>>
>> Clearing away the yz data on that node and restarting riak makes the
>> system work again - and I guess AAE will rebuild the index.
>>
>>
> This sounds very similar to the issue last week. I would certainly like to
> rule out any sort of data corruption (are you thinking your disks are
> corrupting the data or are you assuming Solr is?).
>
> However, it is also possible, like the last issue, that the node/cluster
> simply does not have enough memory. When you delete the data Solr no longer
> has anything to cache in-memory thus using significantly less. As
> discussed, the recommended minimum
>
>
>> But I'm wondering why a crashing Java on one node practically takes down
>> the search on the cluster. Shouldn't Riak be more resilient than that?
>>
>
> The hard part here is, at least initially, the Java process doesn't crash,
> it just starts to timeout. In distributed systems a slow-node is often
> worse than a down node. Riak, prior to 1.4 had something called "health
> check" that would mark a node down in this situation. Unfortunately in some
> workloads, and I believe given your cluster's limited resources it would
> happen here, this often results in excessive work being offloaded to
> another node, which also does not have sufficient resources and around we
> go until the entire cluster falls over. A capacity problem, typically, can
> only be solved by adding more capacity.
>
>
>>
>> Is there a explicit reindex command for the full text search subsystem?
>>
>> Could Riak keep an eye on the java process and restart it if it crashes
>> or runs away?
>>
>>
> Riak does manage the JVM process (starting/stopping/restarting) .I agree
> that if we could include run-away process, like in your case, that would be
> even better. I would have to think a bit more about how this would work (to
> prevent the same problems mentioned above with the old-style health check)
>
> Jordan
>
>
>
>> Chaim Solomon
>>
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users@lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to