[I] Shards in a down state after an HPA scale up / scale down event. [solr-operator]

via GitHub Wed, 31 Jan 2024 04:33:18 -0800


aloosnetmatch opened a new issue, #682:
URL: https://github.com/apache/solr-operator/issues/682


   I installed the solr operator 0.8.0 with solr image 9.4.1 on AKS.
   Using a a guideline this video : [Rethinking Autoscaling for Apache Solr 
using Kubernetes - Berlin Buzzwords 
2023](https://youtu.be/HfHa4Q4YaTU?si=cCiadyOmjlo86sVF)
   
   The setup uses persistent disks.
   
   I created 2 indexes and put some data in it.
   index test: 3 shards and 2 replica's
   index test2: 6 shards and 2 replica's
   
   I configured an HPA and stressed the cluster a bit to make sure the cluster 
would scale up from 5 to 11 nodes.
   Scaling up went fine. Shards for the 2 indexes got moved to the new nodes.
   
   During scaling down, however , some shards get a lot of "down" replica's 
   
![Down_shards](https://github.com/apache/solr-operator/assets/47669630/dad82a69-ab65-4343-8c1f-2f5d99133a49)
   
   The HPA mentioned it would scale down to 5 pods, but there kept 6 running.
   
   The logs offcourse reveal
   > 2024-01-31 10:32:57.332 ERROR 
(recoveryExecutor-10-thread-16-processing-test2_shard4_replica_n113 
netm-solr-operator-solr-cluster-netm-solrcloud-1.ing.local.domain-7376 
move-replicas-solr-cluster-netm-solrcloud-6162936163817514 core_node114 create 
netm-solr-operator-solr-cluster-netm-solrcloud-1.ing.local.domain:80_solr test2 
shard4) [c:test2 s:shard4 r:core_node114 x:test2_shard4_replica_n113 
t:netm-solr-operator-solr-cluster-netm-solrcloud-1.ing.local.domain-7376] 
o.a.s.c.RecoveryStrategy Failed to connect leader 
http://netm-solr-operator-solr-cluster-netm-solrcloud-5.ing.local.domain:80/solr
 on recovery, try again
   2024-01-31 10:32:57.472 ERROR 
(recoveryExecutor-10-thread-11-processing-netm-solr-operator-solr-cluster-netm-solrcloud-1.ing.local.domain:80_solr
 test_shard3_replica_n75 test shard3 core_node76) [c:test s:shard3 
r:core_node76 x:test_shard3_replica_n75 t:] o.a.s.c.RecoveryStrategy Failed to 
connect leader 
http://netm-solr-operator-solr-cluster-netm-solrcloud-6.ing.local.domain:80/solr
 on recovery, try again
   2024-01-31 10:32:57.472 ERROR 
(recoveryExecutor-10-thread-13-processing-netm-solr-operator-solr-cluster-netm-solrcloud-1.ing.local.domain:80_solr
 test_shard3_replica_n87 test shard3 core_node88) [c:test s:shard3 
r:core_node88 x:test_shard3_replica_n87 t:] o.a.s.c.RecoveryStrategy Failed to 
connect leader 
http://netm-solr-operator-solr-cluster-netm-solrcloud-6.ing.local.domain:80/solr
 on recovery, try again
   20
   
   In the overseer there are items still in the work queue.
   
   ![overseer 
errors](https://github.com/apache/solr-operator/assets/47669630/689aba29-1972-41b3-a0fa-3b8b8a69221a)
   
   On the disk for the given shards , i can see the folders of the shards
   > solr@solr-cluster-netm-solrcloud-1:/var/solr/data$ ls -l
   total 108
   drwxrws--- 2 root solr 16384 Jan 30 10:37 lost+found
   -rw-r-xr-- 1 root solr  1203 Jan 30 13:56 solr.xml
   drwxrwsr-x 3 solr solr  4096 Jan 30 12:59 test2_shard1_replica_n12
   drwxrwsr-x 3 solr solr  4096 Jan 30 12:59 test2_shard3_replica_n2
   drwxr-sr-x 3 solr solr  4096 Jan 31 01:30 test2_shard4_replica_n101
   drwxr-sr-x 3 solr solr  4096 Jan 31 03:31 test2_shard4_replica_n113
   drwxr-sr-x 3 solr solr  4096 Jan 31 05:32 test2_shard4_replica_n125
   drwxr-sr-x 3 solr solr  4096 Jan 31 07:33 test2_shard4_replica_n137
   drwxr-sr-x 3 solr solr  4096 Jan 31 09:34 test2_shard4_replica_n149
   drwxr-sr-x 3 solr solr  4096 Jan 30 15:24 test2_shard4_replica_n41
   drwxr-sr-x 3 solr solr  4096 Jan 30 17:25 test2_shard4_replica_n53
   drwxr-sr-x 3 solr solr  4096 Jan 30 19:26 test2_shard4_replica_n65
   drwxr-sr-x 3 solr solr  4096 Jan 30 21:28 test2_shard4_replica_n77
   drwxr-sr-x 3 solr solr  4096 Jan 30 23:29 test2_shard4_replica_n89
   drwxr-sr-x 3 solr solr  4096 Jan 31 04:31 test_shard3_replica_n111
   drwxr-sr-x 3 solr solr  4096 Jan 31 06:32 test_shard3_replica_n123
   drwxr-sr-x 3 solr solr  4096 Jan 31 08:33 test_shard3_replica_n135
   drwxr-sr-x 3 solr solr  4096 Jan 30 16:25 test_shard3_replica_n39
   drwxr-sr-x 3 solr solr  4096 Jan 30 18:26 test_shard3_replica_n51
   drwxrwsr-x 3 solr solr  4096 Jan 30 11:18 test_shard3_replica_n6
   drwxr-sr-x 3 solr solr  4096 Jan 30 20:27 test_shard3_replica_n63
   drwxr-sr-x 3 solr solr  4096 Jan 30 22:28 test_shard3_replica_n75
   drwxr-sr-x 3 solr solr  4096 Jan 31 00:29 test_shard3_replica_n87
   drwxr-sr-x 3 solr solr  4096 Jan 31 02:30 test_shard3_replica_n99
   solr@solr-cluster-netm-solrcloud-1:/var/solr/data$
   
   
   They all seemed empty though.
   
   So i suspect something wrong with the scale down/up   / migration of the 
shards.
   Every pod gets restarted during the downgrade......
   
   What could be the issue for the number of down shards to be so huge.
   
   PS i did the same test on a Kind cluster with the same results.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[I] Shards in a down state after an HPA scale up / scale down event. [solr-operator]

Reply via email to