This is how things are looking after two weeks: - there are no solr indexing issues for a long period (2 weeks) - there are no yokozuna errors at all for 2 weeks - there is an index with all empty schema, just _yz_* fields, objects stored in a bucket(s) are binary and so are not analysed by yokozuna - same yokozuna query repeated gives different number for num_found, typically the difference between real number of keys in a bucket and num_found is about 25% - number of keys repaired by AAE (according to logs) is about 1-2 per few hours (number of keys "missing" in index is close to 1,000,000)
Should I now try to delete the index and yokozuna AAE data and wait another 2 weeks? If yes - how should I delete the index and AAE data? Will RpbYokozunaIndexDeleteReq be enough? On 18 March 2016 at 18:54, Oleksiy Krivoshey <oleks...@gmail.com> wrote: > Hi Magnus, > > As of today I had no Yokozuna messages (in sol.log) for 5 days. In Riak > log I can see some small amount of keys repaired by AAE: > > @riak_kv_exchange_fsm:key_exchange:263 Repaired 1 keys during active > anti-entropy exchange of {148433760041419827630061740822747494183805648896,3} > between {159851741583067506678528028578343455274867621888,'riak@10.0.1.4'} > and {171269723124715185726994316333939416365929594880,'riak@10.0.1.5'} > > However amount of such repaired keys is not larger than 1-2 keys per hour, > while amount of keys missing in search index (for different buckets of the > same fs_chunks type) is close to 1,000,000. > > Thanks! > > On 16 March 2016 at 21:49, Oleksiy Krivoshey <oleks...@gmail.com> wrote: > >> Hi Magnus, >> >> I don't see any Solr indexing errors anymore. As well as no timeout >> errors (I've throttled down the application that was querying Yokozuna). >> >> I've attached `search aae-status` results from all nodes. >> >> Few days ago I saw the following error (few of them) from Yokozuna Solr >> (not sure if it got to the riak-debug output): >> >> <qtp2109588399-37136>@SolrException.java:120 IO error while trying to get >> the size of the Directory:java.io.FileNotFoundException: _2hky_1.del at >> org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:261) at >> org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:178) >> at org.apache.solr.core.DirectoryFactory.sizeOf(DirectoryFactory.java:209) >> at >> org.apache.solr.core.DirectoryFactory.sizeOfDirectory(DirectoryFactory.java:195) >> at >> org.apache.solr.handler.admin.CoreAdminHandler.getIndexSize(CoreAdminHandler.java:1129) >> at >> org.apache.solr.handler.admin.CoreAdminHandler.getCoreStatus(CoreAdminHandler.java:1105) >> at >> org.apache.solr.handler.admin.CoreAdminHandler.handleStatusAction(CoreAdminHandler.java:705) >> at >> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:167) >> at >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) >> at >> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:732) >> at >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:268) >> at >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217) >> >> >> Should I delete Yokozuna AAE trees? >> >> Thanks! >> >> >> On 16 March 2016 at 17:51, Magnus Kessler <mkess...@basho.com> wrote: >> >>> Hi Oleksiy, >>> >>> Whether or not 'brod' has anything to do with the failure to eventually >>> reach consistency in the Solr indexes remains to be seen. I just had never >>> come across this application in a Riak context and wanted to establish why >>> there appear to be frequent errors in the Riak logs associated with it. As >>> the brod application is not in use, could you temporarily remove it from >>> the erlang module path while you troubleshoot Solr indexing? >>> >>> We recommend having a catch-all field in all Solr schemas, as this >>> prevents indexing failures. However, you are correct that arbitrary mime >>> types are not analysed unless they have a Yokozuna extractor associated >>> with them. By default the yz_noop_extractor is used. >>> >>> We do not have an easy way to query for missing keys that should be >>> indexed. The necessary code is obviously part of Yokozuna's AAE repair >>> mechanism, but there is currently no tool that exposes this functionality >>> to the end user. This is why one re-indexing strategy is based on deleting >>> the Yokozuna AAE trees, which causes a rebuild of these trees and checks if >>> the appropriate entries exist in Solr. >>> >>> Looking at the output of 'riak-admin search aae_status' from the debug >>> logs, I notice that on the majority of nodes the trees have been rebuilt >>> very recently, but the most current YZ AAE exchanges have happened around 8 >>> days before (with the exception of one node). Re-indexing occurs during the >>> AAE exchanges. Therefore I have to conclude that the discrepancies in >>> numbers are due to the majority of the exchanges not yet having taken >>> place. We will have to understand why YZ AAE exchanges are not occurring or >>> not terminating successfully. >>> >>> Can you provide more recent results from running `riak-admin search >>> aae_status` on all nodes, please? Also please state if you observe any more >>> indexing failures or solr query timeouts in your logs. >>> >>> Kind Regards, >>> >>> Magnus >>> >>> >>> >>> >>> On 15 March 2016 at 17:03, Oleksiy Krivoshey <oleks...@gmail.com> wrote: >>> >>>> I wasn't correct in my email, application brod was running before but >>>> was not used. Its being used from a post-commit hook and that hook wasn't >>>> enabled on a bucket yet. >>>> >>>> Is there any way I can enable more debug output from Yokozuna? Can I >>>> somehow see which node (in coverage plan) is missing indexed keys? >>>> >>>> On 15 March 2016 at 18:46, Oleksiy Krivoshey <oleks...@gmail.com> >>>> wrote: >>>> >>>>> Hi Magnus, >>>>> >>>>> Thank you, I really appreciate you spending your time on this. >>>>> >>>>> The application brod included in a post-commit hook wasn't running >>>>> when I was removing the index and/or expiring AAE trees. As well as there >>>>> were no Yokozuna queries running. I started these processes just yesterday >>>>> and they were running when I issued riak-debug but not before. >>>>> >>>>> The schema I use for xxxxx.chunks is: >>>>> https://www.dropbox.com/s/ab3br7dxhnjx1sl/Screenshot%202016-03-15%2018.38.52.png?dl=0 >>>>> I will try moving "ignore" line above _yz_* fields. However objects in >>>>> xxxx.chunks bucket are binary (contentType is set to >>>>> 'binary/octet-stream') >>>>> and I assume Yokozuna shouldn't even try to analyse such object content. >>>>> Is >>>>> this correct? >>>>> >>>>> Can a "failed to index" error for a "xxxx.files" bucket be the reason >>>>> for indexing issues for another bucket xxxxx.chunks that is using >>>>> different >>>>> index? I mean, does it affect AAE repair or something? >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> On 15 March 2016 at 18:30, Magnus Kessler <mkess...@basho.com> wrote: >>>>> >>>>>> Hi Oleksiy, >>>>>> >>>>>> Thank you for providing the debug files. I have downloaded them and >>>>>> had a first look, but can only work on them when there's time between >>>>>> working issues for paying customers. >>>>>> >>>>>> Having said that, I'd like to know the origin of messages like >>>>>> >>>>>> 2016-03-14 04:11:59.283 [error] <0.14213.1803> CRASH REPORT Process >>>>>> <0.14213.1803> with 0 neighbours exited with reason: tcp_closed in >>>>>> brod_sock:handle_msg/3 line 158 >>>>>> 2016-03-14 04:11:59.284 [error] <0.20875.1802> gen_server >>>>>> <0.20875.1802> terminated with reason: {socket_down,tcp_closed} >>>>>> 2016-03-14 04:11:59.284 [error] <0.20875.1802> CRASH REPORT Process >>>>>> <0.20875.1802> with 1 neighbours exited with reason: >>>>>> {socket_down,tcp_closed} in gen_server:terminate/6 line 744 >>>>>> 2016-03-14 04:11:59.285 [error] <0.29923.0> Supervisor >>>>>> replication_sup had child replication started with >>>>>> replication:start_link() >>>>>> at <0.19687.1802> exit with reason {socket_down,tcp_closed} in context >>>>>> child_terminated >>>>>> >>>>>> These appear to be related to the 'brod' module ( >>>>>> https://github.com/klarna/brod), which has never been part of an >>>>>> official Riak release. I'd like to rule out that this module interferes >>>>>> with the 'yokozuna' module in any way. >>>>>> >>>>>> The log also contains an indexing failure >>>>>> >>>>>> 2016-03-11 09:06:32.612 [error] <0.1215.0>@yz_kv:index_internal:237 >>>>>> failed to index object >>>>>> {{<<"fs_files">>,<<"o9hkcblhhunwm15ec5cicdmr6fgbz3id.files">>},<<"/products/BRAND/BRAND_108884.PNG">>} >>>>>> with error {"Failed to index docs",other,{error,req_timedout}} because >>>>>> [{yz_solr,index,3,[{file,"src/yz_solr.erl"},{line,205}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,293}]},{yz_kv,index_internal,5,[{file,"src/yz_kv.erl"},{line,224}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1619}]},{riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1607}]},{riak_kv_vnode,do_put,7,[{file,"src/riak_kv_vnode.erl"},{line,1398}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"},{line,558}]},{riak_core_vnode,vnode_command,3,[{file,"src/riak_core_vnode.erl"},{line,346}]}] >>>>>> >>>>>> The root cause of this appears to be a timeout, which is a sign that >>>>>> the cluster was overloaded at this point in time. If this happens during >>>>>> re-indexing, any affected objects will be missing from the index on one >>>>>> or >>>>>> more Solr instances. This should be fixed in time by a subsequent AAE >>>>>> repair, but will obviously contribute to inconsistent indexes. >>>>>> >>>>>> Other signs of overload can be seen when searches time out, too: >>>>>> >>>>>> ./riak@10.0.1.2-riak-debug/logs/platform_log_dir/console.log.2:2016-03-11 >>>>>> 09:06:32.612 [info] <0.10458.1086>@yz_pb_search:maybe_process:106 "Failed >>>>>> to search" "http://localhost:8093/internal_solr/files_index/select" >>>>>> {error,req_timedout} >>>>>> >>>>>> If you observe regular timeouts (or very long search related >>>>>> latencies for that matter), you should either throttle the client load on >>>>>> the cluster or add more nodes to spread the load. Please also keep an eye >>>>>> on IO waits, and provide better provisioned disks if these are constantly >>>>>> high. We find disk IO to be the root cause of many performance issues. >>>>>> >>>>>> Finally, I'd like to suggest that the minimal schema used has at >>>>>> least the catch-all dynamic field definition in addition to the '_yz*' >>>>>> fields. Please consider adding >>>>>> >>>>>> <!-- catch-all field --> >>>>>> <dynamicField name="*" type="ignored" /> >>>>>> >>>>>> before the '_yz*' fields. >>>>>> >>>>>> Kind Regards, >>>>>> >>>>>> Magnus >>>>>> >>>>>> >>>>>> On 14 March 2016 at 15:47, Oleksiy Krivoshey <oleks...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Magnus, >>>>>>> >>>>>>> Thank you! >>>>>>> >>>>>>> Here are the files: >>>>>>> >>>>>>> https://www.dropbox.com/s/rjide53i9wnzc03/riak%4010.0.1.1-riak-debug.tar.gz?dl=0 >>>>>>> >>>>>>> https://www.dropbox.com/s/zdpbkp8j210qxut/riak%4010.0.1.2-riak-debug.tar.gz?dl=0 >>>>>>> >>>>>>> https://www.dropbox.com/s/e1kaaozd7s0oxz1/riak%4010.0.1.3-riak-debug.tar.gz?dl=0 >>>>>>> >>>>>>> https://www.dropbox.com/s/3rwliv2dgpvre1o/riak%4010.0.1.4-riak-debug.tar.gz?dl=0 >>>>>>> >>>>>>> https://www.dropbox.com/s/ptwq7gobo73d802/riak%4010.0.1.5-riak-debug.tar.gz?dl=0 >>>>>>> >>>>>>> Its a busy cluster, so I can explain any additional settings we have. >>>>>>> >>>>>>> Just want to add that I also get inconsistent results with search >>>>>>> index that only has _yz_* fields (unlike the search index that first >>>>>>> raised >>>>>>> this problem, which had Solr indexing issues). >>>>>>> >>>>>>> >>>>>>> On 14 March 2016 at 17:20, Magnus Kessler <mkess...@basho.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Oleksiy, >>>>>>>> >>>>>>>> Would you mind sharing the output of 'riak-debug' from all nodes? >>>>>>>> You can upload the files to a location of your choice and PM me the >>>>>>>> details. As far as we are aware we have fixed all previously existing >>>>>>>> issues that would prevent a full YZ AAE tree rebuild from succeeding >>>>>>>> when >>>>>>>> non-indexable data was present. However, the logs may still contain >>>>>>>> hints >>>>>>>> that may help us to identify the root cause of your issue. >>>>>>>> >>>>>>>> Many Thanks, >>>>>>>> >>>>>>>> Magnus >>>>>>>> >>>>>>>> On 14 March 2016 at 09:45, Oleksiy Krivoshey <oleks...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I would like to continue as this seems to me like a serious >>>>>>>>> problem, on a bucket with 700,000 keys the difference in num_found >>>>>>>>> can be >>>>>>>>> up to 200,000! >>>>>>>>> And thats a search index that doesn't index, analyse or store ANY >>>>>>>>> of the document fields, the schema has only required _yz_* fields and >>>>>>>>> nothing else. >>>>>>>>> >>>>>>>>> I have tried deleting the search index (with PBC call) and tried >>>>>>>>> expiring AAE trees. Nothing helps. I can't get consistent search >>>>>>>>> results >>>>>>>>> from Yokozuna. >>>>>>>>> >>>>>>>>> Please help. >>>>>>>>> >>>>>>>>> On 11 March 2016 at 18:18, Oleksiy Krivoshey <oleks...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Fred, >>>>>>>>>> >>>>>>>>>> This is production environment but I can delete the index. >>>>>>>>>> However this index covers ~3500 buckets and there are probably >>>>>>>>>> 10,000,000 >>>>>>>>>> keys. >>>>>>>>>> >>>>>>>>>> The index was created after the buckets. The schema for the index >>>>>>>>>> is just the basic required fields (_yz_*) and nothing else. >>>>>>>>>> >>>>>>>>>> Yes, I'm willing to resolve this. When you say to delete >>>>>>>>>> chunks_index, do you mean the simple RpbYokozunaIndexDeleteReq or >>>>>>>>>> something >>>>>>>>>> else is required? >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 11 March 2016 at 17:08, Fred Dushin <fdus...@basho.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Oleksiy, >>>>>>>>>>> >>>>>>>>>>> This is definitely pointing to an issue either in the coverage >>>>>>>>>>> plan (which determines the distributed query you are seeing) or in >>>>>>>>>>> the data >>>>>>>>>>> you have in Solr. I am wondering if it is possible that you have >>>>>>>>>>> some data >>>>>>>>>>> in Solr that is causing the rebuild of the YZ AAE tree to >>>>>>>>>>> incorrectly >>>>>>>>>>> represent what is actually stored in Solr. >>>>>>>>>>> >>>>>>>>>>> What you did was to manually expire the YZ (Riak Search) AAE >>>>>>>>>>> trees, which caused them to rebuild from the entropy data stored in >>>>>>>>>>> Solr. >>>>>>>>>>> Another thing we could try (if you are willing) would be to delete >>>>>>>>>>> the >>>>>>>>>>> 'chunks_index' data in Solr (as well as the Yokozuna AAE data), and >>>>>>>>>>> then >>>>>>>>>>> let AAE repair the missing data. What Riak will essentially do is >>>>>>>>>>> compare >>>>>>>>>>> the KV hash trees with the YZ hash trees (which will be empty), too >>>>>>>>>>> that it >>>>>>>>>>> is missing in Solr, and add it to Solr, as a result. This would >>>>>>>>>>> effectively result in re-indexing all of your data, but we are only >>>>>>>>>>> talking >>>>>>>>>>> about ~30k entries (times 3, presumably, if your n_val is 3), so >>>>>>>>>>> that >>>>>>>>>>> shouldn't take too much time, I wouldn't think. There is even some >>>>>>>>>>> configuration you can use to accelerate this process, if necessary. >>>>>>>>>>> >>>>>>>>>>> Is that something you would be willing to try? It would result >>>>>>>>>>> in down time on query. Is this production data or a test >>>>>>>>>>> environment? >>>>>>>>>>> >>>>>>>>>>> -Fred >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>> Magnus Kessler >>>>>>>> Client Services Engineer >>>>>>>> Basho Technologies Limited >>>>>>>> >>>>>>>> Registered Office - 8 Lincoln’s Inn Fields London WC2A 3BP Reg >>>>>>>> 07970431 >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Magnus Kessler >>>>>> Client Services Engineer >>>>>> Basho Technologies Limited >>>>>> >>>>>> Registered Office - 8 Lincoln’s Inn Fields London WC2A 3BP Reg >>>>>> 07970431 >>>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Magnus Kessler >>> Client Services Engineer >>> Basho Technologies Limited >>> >>> Registered Office - 8 Lincoln’s Inn Fields London WC2A 3BP Reg 07970431 >>> >> >> >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com