Difference in response times between direct to shard vs random shard with route param

yasoobhaider Mon, 15 Mar 2021 01:08:58 -0700

Hi

I'm trying to create the right design for a Solr Cloud cluster which is
robust and responsive. I've been playing with different versions.


I'll share details about the two versions that I'm comparing.

Cluster details:

1 collection, 4 shards, 2 replicas each.
8 nodes. So 1 replica on 1 node. Each node is 32G memory, 16 cores. Heap
size is 24G. Using Solr 7.6 with G1GC as that gave better performance over
CMS.
Collections size is small ~8GB overall (I know, very small for sharding. But
our queries are extremely complex)

The collection is sharded using implicit router - using city id.

The two configurations I'm trying are:

1. Send the query to a loadbalancer with _route_=city_id (or
_route_=shardnum) which sends it to 1 of the 8 boxes. This box gets the
result from the "owner" shard and returns the results.
2. Send the query directly to 1 of the replica of the owner shard.

I also add "shards.preference=replica.location:local" to queries in both
versions.

So, if I have Nodes N1, N2 with S1R1 and S1R2, N3, N4 with S2R1 and S2R2 and
so on, then a query for a city which is in shard1, either goes to a LB which
can go to N5, which will then query either of N1 or N2 and return the
results, or it goes directly to one of N1 or N2.

The response is pretty big. It fetches ~3000 documents, and a large number
of fields (~30).

I'm measuring the response times at an Ingress Envoy at Solr end.

Results:

Difference in response times between LB and Direct call is ~20-25 ms. The
direct call is significantly faster at ~70ms avg vs ~95ms for LB call.

Checking the logs I noticed that when calling via Loadbalancer, there are
these queries in logs:

2021-03-15 07:47:39.611 INFO  (qtp731870416-2408) [c:collection_xx s:shard1
r:core_node5 x:collection_xx_shard1_replica_n2] o.a.s.c.S.Request
[collection_xx_shard1_replica_n2]  webapp=/solr path=/select
params={vsort=ntp&facet.field=c_ids&facet.field=has_d_offer&facet.field=has_xx_synergy&facet.field=new_ccs_ids&facet.field=new_pd&facet.field=new_pd_dsz_7619&facet.field=new_res_flag&facet.field=primary_category_ids&df=name&distrib=false&aaaq_score_param=7619_6&fl=real_id,name,chain_id,lat,lon,city_id,d_name:display_name,image,new_ccs_ids,cfo:c_for_one,new:if(new_res_flag,1,0),new_on_xx:if(new_on_d,1,0),hygiene_rated:termfreq(new_pd,+100353),pure_veg:termfreq(new_pd,+100354),gold:termfreq(new_pd,+102236),pro:termfreq(new_pd,+166788),otof:termfreq(new_pd,+114445),hyperpure:termfreq(new_pd,+100355),exclusive:if(has_xx_synergy,1,0),has_offer:if(has_d_offer,1,0),trending:termfreq(new_pd_dsz_7619,+1),has_gourmet:termfreq(new_pd,+166253),ncw_offer:termfreq(new_pd,+168125),ncw_brand:termfreq(new_pd,+168176),hygiene_rating,c_ids,avg_commission_per_order,otr_value_um,otr_value_mm,otr_value_la,otr_value_default,primary_category_ids,compliance_level,cgen_embedding,asv,rating_aggregate:rating,votes,is_suspicious,{!key%3Dscore}$raw_score+&fl=id&shards.purpose=64&start=0&fq=(serviceable_cells:(4306215339680591232))+OR+{!geofilt+filter%3Dtrue+sfield%3Dlatlon_location_rpt+pt%3D13.498693941619575,70.84631715107243+d%3D7}&fq=+((d_pondy_5_1_start:[*+TO+2130]++++++AND+d_pondy_5_1_end:[2130+TO+*])+OR+(d_pondy_5_2_start:[*+TO+2130]++++++AND+d_pondy_5_2_end:[2130+TO+*])+OR+(d_pondy_5_3_start:[*+TO+2130]++++++AND+d_pondy_5_3_end:[2130+TO+*]))&fq=+has_online_order_flag:1&fq=+opening_soon_flag:false&fq=+status_id:(1+OR+13)&fq=+temp_closed_flag:false&raw_score=sum(product(def(conversion_score_dsz_v3_final_score_7619_6,+0),+1))&shard.url=http://a.b.c.d:8080/solr/collection_xx_shard1_replica_n2/|http://a.b.c.e:8080/solr/collection_xx_shard1_replica_n17/&rows=3000&version=2&facet.query=dish_score_9d20b49f8cf0c79ce7b44b2ef69f51df_2:[0+TO+*]&facet.limit=1000&q=(*)+AND+_val_:"+++++++++++++sum(+++++++++++++++++$vsort,+++++++++++++++++product(400,+0),+++++++++++++++++-200+++++++++++++)+++++++++"&NOW=1615794459480&ids=res_19338624,res_18645801,res_19258866,...[[------>
Some ~2900 ids here  <---------]]
....,res_18565614,res_19250515,res_19282565,res_18818033,res_18372078,res_19527362,res_18899444&isShard=true&facet.mincount=1&boosted=0&facet=false&wt=javabin}
status=0 QTime=30

Notice the "Some 2900 ids here" part.

Questions:

Is this some inter-node communication happening? Is this what is leading to
the difference in response times.

If not this, then what else could be leading to the difference in response
times?





--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Difference in response times between direct to shard vs random shard with route param

Reply via email to