Hi I'm trying to create the right design for a Solr Cloud cluster which is robust and responsive. I've been playing with different versions.
I'll share details about the two versions that I'm comparing. Cluster details: 1 collection, 4 shards, 2 replicas each. 8 nodes. So 1 replica on 1 node. Each node is 32G memory, 16 cores. Heap size is 24G. Using Solr 7.6 with G1GC as that gave better performance over CMS. Collections size is small ~8GB overall (I know, very small for sharding. But our queries are extremely complex) The collection is sharded using implicit router - using city id. The two configurations I'm trying are: 1. Send the query to a loadbalancer with _route_=city_id (or _route_=shardnum) which sends it to 1 of the 8 boxes. This box gets the result from the "owner" shard and returns the results. 2. Send the query directly to 1 of the replica of the owner shard. I also add "shards.preference=replica.location:local" to queries in both versions. So, if I have Nodes N1, N2 with S1R1 and S1R2, N3, N4 with S2R1 and S2R2 and so on, then a query for a city which is in shard1, either goes to a LB which can go to N5, which will then query either of N1 or N2 and return the results, or it goes directly to one of N1 or N2. The response is pretty big. It fetches ~3000 documents, and a large number of fields (~30). I'm measuring the response times at an Ingress Envoy at Solr end. Results: Difference in response times between LB and Direct call is ~20-25 ms. The direct call is significantly faster at ~70ms avg vs ~95ms for LB call. Checking the logs I noticed that when calling via Loadbalancer, there are these queries in logs: 2021-03-15 07:47:39.611 INFO (qtp731870416-2408) [c:collection_xx s:shard1 r:core_node5 x:collection_xx_shard1_replica_n2] o.a.s.c.S.Request [collection_xx_shard1_replica_n2] webapp=/solr path=/select params={vsort=ntp&facet.field=c_ids&facet.field=has_d_offer&facet.field=has_xx_synergy&facet.field=new_ccs_ids&facet.field=new_pd&facet.field=new_pd_dsz_7619&facet.field=new_res_flag&facet.field=primary_category_ids&df=name&distrib=false&aaaq_score_param=7619_6&fl=real_id,name,chain_id,lat,lon,city_id,d_name:display_name,image,new_ccs_ids,cfo:c_for_one,new:if(new_res_flag,1,0),new_on_xx:if(new_on_d,1,0),hygiene_rated:termfreq(new_pd,+100353),pure_veg:termfreq(new_pd,+100354),gold:termfreq(new_pd,+102236),pro:termfreq(new_pd,+166788),otof:termfreq(new_pd,+114445),hyperpure:termfreq(new_pd,+100355),exclusive:if(has_xx_synergy,1,0),has_offer:if(has_d_offer,1,0),trending:termfreq(new_pd_dsz_7619,+1),has_gourmet:termfreq(new_pd,+166253),ncw_offer:termfreq(new_pd,+168125),ncw_brand:termfreq(new_pd,+168176),hygiene_rating,c_ids,avg_commission_per_order,otr_value_um,otr_value_mm,otr_value_la,otr_value_default,primary_category_ids,compliance_level,cgen_embedding,asv,rating_aggregate:rating,votes,is_suspicious,{!key%3Dscore}$raw_score+&fl=id&shards.purpose=64&start=0&fq=(serviceable_cells:(4306215339680591232))+OR+{!geofilt+filter%3Dtrue+sfield%3Dlatlon_location_rpt+pt%3D13.498693941619575,70.84631715107243+d%3D7}&fq=+((d_pondy_5_1_start:[*+TO+2130]++++++AND+d_pondy_5_1_end:[2130+TO+*])+OR+(d_pondy_5_2_start:[*+TO+2130]++++++AND+d_pondy_5_2_end:[2130+TO+*])+OR+(d_pondy_5_3_start:[*+TO+2130]++++++AND+d_pondy_5_3_end:[2130+TO+*]))&fq=+has_online_order_flag:1&fq=+opening_soon_flag:false&fq=+status_id:(1+OR+13)&fq=+temp_closed_flag:false&raw_score=sum(product(def(conversion_score_dsz_v3_final_score_7619_6,+0),+1))&shard.url=http://a.b.c.d:8080/solr/collection_xx_shard1_replica_n2/|http://a.b.c.e:8080/solr/collection_xx_shard1_replica_n17/&rows=3000&version=2&facet.query=dish_score_9d20b49f8cf0c79ce7b44b2ef69f51df_2:[0+TO+*]&facet.limit=1000&q=(*)+AND+_val_:"+++++++++++++sum(+++++++++++++++++$vsort,+++++++++++++++++product(400,+0),+++++++++++++++++-200+++++++++++++)+++++++++"&NOW=1615794459480&ids=res_19338624,res_18645801,res_19258866,...[[------> Some ~2900 ids here <---------]] ....,res_18565614,res_19250515,res_19282565,res_18818033,res_18372078,res_19527362,res_18899444&isShard=true&facet.mincount=1&boosted=0&facet=false&wt=javabin} status=0 QTime=30 Notice the "Some 2900 ids here" part. Questions: Is this some inter-node communication happening? Is this what is leading to the difference in response times. If not this, then what else could be leading to the difference in response times? -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html