Based on the problem description, it seems you might have a work-around of submitting collection admin commands to the Overseer node instead? At least during the upgrade. Also... I wonder if upgrading to 9.0 instead of 9.6 may help. Not that I know of anything specific that's incompatible but I could imagine hypothetical changes across the 9.x line that 8.x can't accept in terms of Overseer queue where both receiver and sender must mutually work together indirectly via ZK, a kind of protocol in a sense. Again, this is theoretical.
Sadly, Solr upgrade compatibility is not something the project has an automated test for, nor even a human script to follow to do. In the age of Docker, this shouldn't be hard. It's a gap for sure. FWIW we "care" about it... we think about it and kind of insist on it in terms of standards / acceptance criteria but without a test... it's a "best effort". On Wed, Oct 16, 2024 at 2:38 PM Patrick Lok <patrick....@salesforce.com.invalid> wrote: > Hi Jan, > > Thank you so much for responding. Really appreciate it. > > I thought that's a problem with Solr 8.5 or older. We have migrated to Solr > 8.11.3 and removed the use of the useUnsafeOverseerResponse flag. And from > the other error message ("consider as bad message and poll out from the > queue") that I'm seeing, it looks like the overseer is actually able to > deserialize the message, but it's hitting a KeeperException? > > Thanks, > Patrick > > > On Wed, Oct 16, 2024 at 12:41 AM Jan Høydahl <jan....@cominvent.com> > wrote: > > > Hi > > > > I believe that the objects on the Overseer queue are serialized java > > objects and so you cannot create collections while in the middle of a > major > > upgrade. > > I'd pause such cluster events during the rolling upgrade so that the > > Overseer queues are empty once the overseer node is upgraded. > > > > Jan > > > > > 16. okt. 2024 kl. 04:31 skrev Patrick Lok <patrick....@salesforce.com > > .INVALID>: > > > > > > Here's the request we are sending over the wire to Solr 9 > > > > > > > > > > > > "class":"org.apache.solr.client.solrj.request.CollectionAdminRequest$Create", > > > "method":"GET", > > > "params.action":"CREATE", > > > "params.name":"ftest-collection_1.2", > > > "params.collection.configName":"test-collection", > > > "params.createNodeSet":"EMPTY", > > > "params.numShards":"2", > > > "params.router.name":"compositeId", > > > "params.nrtReplicas":"1", > > > "params.autoAddReplicas":"false"} > > > > > > > > > On Tue, Oct 15, 2024 at 7:20 PM Patrick Lok < > patrick....@salesforce.com> > > > wrote: > > > > > >> Hi, > > >> > > >> I'm new to Solr and I'm tasked to upgrade our Solr 8.11.3 installation > > to > > >> Solr 9.6.1. > > >> > > >> I'm running into some trouble with the create collection command when > > it's > > >> sent to a Solr 9.6.1 node with Solr 8.11.3 running as overseers. > > >> > > >> The command in Java is > > >> CollectionAdminRequest.createCollection(collectionName, configName, > > >> numShards, 0) > > >> .setAutoAddReplicas(false) > > >> .setRouterName("compositeId") > > >> .setCreateNodeSet("EMPTY") > > >> .setReplicationFactor(1); > > >> > > >> And the error that I see on the overseer can be either of the one > > below. I > > >> guess it depends on if the collection has been created (but deleted) > > before > > >> or not. > > >> > > >> If the collection has been created before but deleted. I'll see in the > > >> overseer (Solr 8) log > > >> > > >> 01:42:43.927 ERROR (OverseerThreadFactory-25-t...:8983_solr) [ ] > > >> o.a.s.c.a.c.OverseerCollectionMessageHandler Collection: > > >> test-collection_1.2 operation: create failed > > >> org.apache.solr.common.SolrException: Could not fully create > collection: > > >> test-collection_1.2 > > >> at > > >> > > > org.apache.solr.cloud.api.collections.CreateCollectionCmd.call(CreateCollectionCmd.java:218) > > >> ~[?:?] > > >> at > > >> > > > org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:271) > > >> ~[?:?] > > >> at > > >> > > > org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:524) > > >> ~[?:?] > > >> at > > >> > > > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218) > > >> ~[?:?] > > >> at > > >> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > > >> ~[?:?] > > >> at > > >> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > > >> ~[?:?] > > >> at java.lang.Thread.run(Thread.java:829) ~[?:?] > > >> > > >> > > >> > > >> > > >> But if the collection has never been created before, then I see in the > > >> overseer log > > >> > > >> 01:42:14.439 INFO (OverseerThreadFactory-25-thread-..._solr) [ ] > > >> o.a.s.c.a.c.CreateCollectionCmd Create collection > > test1-collection_1.2 > > >> 01:42:14.442 INFO (OverseerCollectionConfigSetProcessor-...) [ ] > > >> o.a.s.c.OverseerTaskQueue Response ZK path: > > >> /overseer/collection-queue-work/qnr-0000707821 doesn't exist. > Requestor > > may > > >> have disconnected from ZooKeeper > > >> 01:42:14.469 ERROR (OverseerStateUpdate-3026498...) [ ] > > o.a.s.c.Overseer > > >> Exception in Overseer main queue loop > > >> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode > = > > >> NoNode for /clusterstate.json > > >> at > > >> org.apache.zookeeper.KeeperException.create(KeeperException.java:118) > > ~[?:?] > > >> at > > >> org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > > ~[?:?] > > >> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:2561) > > >> ~[?:?] > > >> at > > >> > > > org.apache.solr.common.cloud.SolrZkClient.lambda$setData$7(SolrZkClient.java:361) > > >> ~[?:?] > > >> at > > >> > > > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:79) > > >> ~[?:?] > > >> at > > >> > org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:361) > > >> ~[?:?] > > >> at > > >> > > > org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:291) > > >> ~[?:?] > > >> at > > >> > > > org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:217) > > >> ~[?:?] > > >> at > > >> > > > org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:173) > > >> ~[?:?] > > >> at > > >> > > > org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:341) > > >> ~[?:?] > > >> at > > >> > > org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:271) > > >> ~[?:?] > > >> at java.lang.Thread.run(Thread.java:829) ~[?:?] > > >> 01:42:14.490 WARN (OverseerStateUpdate-3026498...) [ ] > > o.a.s.c.Overseer > > >> Exception when process message = { > > >> "replicationFactor":1, > > >> "fromApi":"true", > > >> "collection.configName":"test1-collection", > > >> "router.name":"compositeId", > > >> "createNodeSet":"EMPTY", > > >> "waitForFinalState":null, > > >> "pullReplicas":null, > > >> "async":"70e3b8e7-9ee1-468d-96f6-470900c4edbb", > > >> "router.field":null, > > >> "name":"test1-collection_1.2", > > >> "nrtReplicas":1, > > >> "numShards":2, > > >> "tlogReplicas":null, > > >> "alias":null, > > >> "operation":"create", > > >> "perReplicaState":null}, consider as bad message and poll out from > the > > >> queue > > >> > > >> > > >> Is there a known incompatibility issue between Solr 9 (data node) and > > Solr > > >> 8 (overseer node) with CollectionAdminRequest.createCollection? This > is > > >> what we have been doing for a long time and works with both data and > > >> overseer nodes are running Solr 8. Is there a way to get around this > > issue? > > >> > > >> Thanks, > > >> Patrick > > >> > > >> > > > > >