Hello everyone, I am writing in hopes of getting an answer to this mail. We are struggling with this problem without coming to a solution.
Thanks in advance, Marco -----Original Message----- From: Matteo Diarena <m.diar...@volocom.it> Sent: lunedì 5 settembre 2022 11:02 To: users@solr.apache.org Subject: R: SolrCloud node fail to connect to another node in the cluster Sorry, my fault. I try to rewrite my email without images: I’m experiencing a strange behaviour with a SolrCloud cluster. Cluster description I have a cluster with a total of 38 nodes. All nodes are installed with the following features: - OS: Debian GNU/Linux 9.13 (stretch) - JRE: openjdk version "11.0.6" 2020-01-14 - Apache Solr: Apache Solr 8.11.2 The cluster nodes are divided as follows: Nodes used for indexing solrindex-01 solrindex-02 Nodes used for queries solrquery-01 solrquery-02 Cluster nodes with collections solrnode-01 … solrnode-34 Configuration of the collection In the cluster I have a collection (i.e testcollection) divided on the various nodes through different shards (one shard for each month, i.e. shard_202201, shard_202202, ...) Problem From time to time the solrquery-01 node is no longer able to query the entire collection and in particular it is unable to contact some replicas of the collection present on the other nodes of the cluster. The problem does not resolve itself but it is necessary to restart the Apache Solr service on the solrquery-01 node. In particular: If I try to query a specific replica from the solrquery-01 node, the request remains pending until it times out Query http://solrquery-01:8080/solr/volocomapi_search/select?q=UniqueReference:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFED&debug=track&shards=http://solrnode-24.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n575/ Response { "response":{"numFound":0,"start":0,"numFoundExact":true,"docs":[]}, "debug":{ "track":{ "rid":"solrquery-01.volo.local-232528", "EXECUTE_QUERY":{ "http://solrnode-24.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n575/":{ "Exception":"Timeout occured while waiting response from server at: http://solrnode-24.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n575/select"}}}} } By executing the same query from another node (eg: solrnode-01) the query is successful. Query http://solrnode-01:8080/solr/volocomapi_search/select?q=UniqueReference:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFED&debug=track&shards=http://solrnode-24.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n575/ Response: { "response":{"numFound":0,"start":0,"maxScore":0.0,"numFoundExact":true,"docs":[]}, "debug":{ "track":{ "rid":"solrnode-01.volo.local-1849853", "EXECUTE_QUERY":{ "http://solrnode-24.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n575/":{ "QTime":"0", "ElapsedTime":"28", "RequestPurpose":"GET_TOP_IDS,SET_TERM_STATS", "NumFound":"0", "Response":"{responseHeader={zkConnected=true,status=0,QTime=0},response={numFound=0,numFoundExact=true,start=0,maxScore=0.0,docs=[]},sort_values={},debug={}}"}}}} } The same happens if I try to run the query from solrquery-01 node to a different replica Query http://solrquery-01:8080/solr/volocomapi_search/select?q=UniqueReference:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFED&debug=track&shards=http://solrnode-23.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n573/ Response { "response":{"numFound":0,"start":0,"maxScore":0.0,"numFoundExact":true,"docs":[]}, "debug":{ "track":{ "rid":"solrquery-01.volo.local-232531", "EXECUTE_QUERY":{ "http://solrnode-23.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n573/":{ "QTime":"0", "ElapsedTime":"88", "RequestPurpose":"GET_TOP_IDS,SET_TERM_STATS", "NumFound":"0", "Response":"{responseHeader={zkConnected=true,status=0,QTime=0},response={numFound=0,numFoundExact=true,start=0,maxScore=0.0,docs=[]},sort_values={},debug={}}"}}}} } Checking the network traffic with tcpdump on the solrquery-01 machine does not show any connection as it does on the solrnode-01 machine tcpdump from the solrquery-01 machine tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ens192, link-type EN10MB (Ethernet), capture size 262144 bytes tcpdump on the solrnode-01 machine tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ens192, link-type EN10MB (Ethernet), capture size 262144 bytes 10:57:10.979736 IP solrnode-01.volo.local.39888 > solrnode-24.volo.local.http-alt: Flags [P.], seq 881884455:881885148, ack 1974049136, win 364, options [nop,nop,TS val 561210041 ecr 561833498], length 693: HTTP 10:57:11.008007 IP solrnode-01.volo.local.39888 > solrnode-24.volo.local.http-alt: Flags [.], ack 132, win 364, options [nop,nop,TS val 561210048 ecr 561835614], length 0 Question Do you have any suggestions on how to investigate this issue further? Suggestions on possible solutions? Thank you in advance, Matteo Matteo Diarena Direttore Innovazione Volocom s.r.l. (www.volocom.it - volo...@pec.it) Via Antonio Cechov, 50 - 20151 MILANO Via Leone XIII, 95 - 00165 ROMA Tel +39 02 89453024 / +39 02 89453023 Mobile +39 345 2129244 m.diar...@volocom.it -----Messaggio originale----- Da: Vincenzo D'Amore <v.dam...@gmail.com> Inviato: 05 September 2022 00:34 A: users@solr.apache.org Oggetto: Re: SolrCloud node fail to connect to another node in the cluster Hi Matteo, FYI, images has been removed from your email. The mailing list ate it. You'll need to give us text, not an image. On Thu, 1 Sep 2022 at 16:35, Matteo Diarena <m.diar...@volocom.it> wrote: > Dear all, > > I’m experiencing a strange behaviour with a SolrCloud cluster. > > > > *Cluster description * > > I have a cluster with a total of 38 nodes. All nodes are installed > with the following features: > > - *OS*: Debian GNU/Linux 9.13 (stretch) > - JRE: openjdk version "11.0.6" 2020-01-14 > - Apache Solr: Apache Solr 8.11.2 > > > > The cluster nodes are divided as follows: > > > > *Nodes used for indexing* > > solrindex-01 > > solrindex-02 > > > > *Nodes used for queries* > > solrquery-01 > > solrquery-02 > > > > *Cluster nodes with collections* > > solrnode-01 > > … > > solrnode-34 > > > > *Configuration of the collection* > > In the cluster I have a collection (i.e testcollection) divided on the > various nodes through different shards (one shard for each month, i.e. > shard_202201, shard_202202, ...) > > > > *Problem* > > From time to time the solrquery-01 node is no longer able to query the > entire collection and in particular it is unable to contact some > replicas of the collection present on the other nodes of the cluster. > The problem does not resolve itself but it is necessary to restart the > Apache Solr service on the solrquery-01 node. > > > > In particular: > > If I try to query a specific replica from the solrquery-01 node, the > request remains pending until it times out > > > > Query > > > http://solrquery-01:8080/solr/volocomapi_search/select?q=UniqueReferen > ce:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFE > D&debug=true&shards=http://solrnode-24.volo.local:8080/solr/volocomapi > _search_shard_201501_replica_n575/ > > > > Response > > > > By executing the same query from another node (eg: solrnode-01) the > query is successful. > > > > Query > > > http://solrnode-01:8080/solr/volocomapi_search/select?q=UniqueReferenc > e:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFED > &debug=true&shards=http://solrnode-24.volo.local:8080/solr/volocomapi_ > search_shard_201501_replica_n575/ > > > > > > Response: > > > > The same happens if I try to run the query to a different replica > > > > Query > > > http://solrquery-01:8080/solr/volocomapi_search/select?q=UniqueReferen > ce:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFE > D&debug=true&shards=http://solrnode-23.volo.local:8080/solr/volocomapi > _search_shard_201501_replica_n573/ > > > > Response > > > > > > Checking the network traffic with tcpdump on the solrquery-01 machine > does not show any connection as it does on the solrnode-01 machine > > > > *tcpdump from the solrquery-01 machine* > > > > *tcpdump on the solrnode-01 machine* > > > > *Question* > > Do you have any suggestions on how to investigate this issue further? > Suggestions on possible solutions? > > > > > > Thank you in advance, > > Matteo > -- Vincenzo D'Amore