Hi, I'm trying to use the S3Repository backup feature for the first time with solr 9.7.0 cloud on AWS with a docker image based on 23-jre-noble (also same behavior on 22-jre-jammy). S3 backup/restore is working for small collections (4MB), but it is failing on my larger ones (36GB, 96GB, 400GB).
Here are the commands that work for the smaller client collection to backup and restore: https://solr-dev.domain.com/solr/admin/collections?action=BACKUP&repository=s3&collection=client&name=2024-09-30&location=/ https://solr-dev.domain.com/solr/admin/collections?action=RESTORE&repository=s3&collection=client&name=2024-09-30&location=/ Collection sizes: Works: client_s1r2: 4.3Mb Fails: people_s1r2: 96.6Gb questions_s1r2: 36.1Gb The difference in the backups is that, for the failing ones, it doesn't create backup_0.properties and zk_backup_0 in S3: Good: ``` domain-solr-backups/2024-09-30/client/ backup_0.properties 322B index/ shard_backup_metadata/ zk_backup_0/ ``` Bad: ``` domain-solr-backups/2024-09-30/people/ index/ shard_backup_metadata/ ``` Notice in the bad case above we are missing `backup_0.properties` and `zk_backup_0/` from the S3 listing. Trying to restore the people collections gives me: "msg": "No backup.properties was found, the backup does not exist or not complete", I also can somehow trigger an error where it tries to access the first element of an array and hits a bounds check exception, but I don't have that message handy at the moment. Now, I also tried hitting the replication API endpoint with different results. https://solr-dev.domain.com/solr/questions/replication?command=backup&repository=s3&location=/&name=questions_backup-9-30 https://solr-dev.domain.com/solr/questions_v10/replication?command=restore&repository=s3&location=/&name=questions_backup-9-30 It makes a different s3 directory listing with 169 files, which seem to be segment names: _179_1n.liv 758.3KB _179.cfe 479B _179.cfs 2.2GB _179.si 391B ... The restore command returns immediately but doesn't seem to restore anything. However, I think it eventually did restore my 36GB `questions` collection--I checked a week later and it had data somehow. But trying to restore it again yesterday yielded no data in 24 hours. I don't think this is the right API endpoint, but I'm confused why it works at all and makes a different set of files. Is there just a bug that the `backup.properties` file and zk dir are not getting created? Overall I enjoy working with SOLR. Thank you for your work! -Doug PS: I also noticed that I had some kind of reboot loop in one of my two solr cloud machines where the heap usage was at 10% (96GB RAM for solr to use) and then it would crash with the error `OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000fffadb6a0000, 16384, 0) failed; error='Not enough space' (errno=12)`. Deleting the offending core fixed the loop. This might warrant a bug report but I don't know how to reliably trigger it. PPS: If solr 9.7.0 has such a speedup using Java 21 and higher as per the release notes, why is the default docker image based on JRE 17? "Apache Lucene upgraded to 9.11.1 introducing tremendous performance improvements when using Java 21 for vector search among other things. (SOLR-17325)" https://github.com/apache/solr-docker/blob/main/9.7/Dockerfile#L17