Hi,

I run a small 3-node test cluster with Solr Operator and Solr 9.6.1. Have 
configured the affinity placement plugin as follows

{
  "plugin": {
    ".placement-plugin": {
      "name": ".placement-plugin",
      "class": 
"org.apache.solr.cluster.placement.plugins.AffinityPlacementFactory",
      "config": {"minimalFreeDiskGB":2,"prioritizedFreeDiskGB":100}
    }
  }
}

There is plenty of free disk and all three PODs are healthy.

Now I can create one or a few collections with 3 NRT replicas successfully. The 
affinity plugin makes sure that each replica is on different PODs (as opposed 
to the default which is round-robin). Also, if one of the PODs is down, the 
plugin thows an error so client can re-try creating the collection once all 
three PODs are online.

Now, after some time, creating another collection fails with message "Not 
enough eligible nodes to place 3 replica(s) of type NRT for shard shard1 of 
collection foo", even if the cluster is healthy with three nodes online and all 
three nodes listed in "live_nodes". The full stack trace is here 
https://gist.github.com/janhoy/a50e48d93be6b849cbf0a6722a89ba21

Looks like the OrderedNodePlacementPlugin somehow believes that two nodes are 
down or otherwise not eligible. 

I have to restart/delete one or two PODs for it to work again. I first thought 
it would be enough to restart the overseer node, but last I tried, the error 
mssage only became worse: "Only able to place 0 replicas". One or two more 
restarts may make it work again, before it again becomes locked.

Debug logging does not reveal much more.

I see a few similar test failures in builds mailing list:

- BATS test "Affinity placement plugin using sysprop" fails three times in 2023
- PlacementPluginIntegrationTest fails tree times in 2023 and once June 1st

Anyone have any insight?

Jan

Reply via email to