Hi,

I'm trying to use the S3Repository backup feature for the first time with
solr 9.7.0 cloud on AWS with a docker image based on 23-jre-noble (also
same behavior on 22-jre-jammy). S3 backup/restore is working for small
collections (4MB), but it is failing on my larger ones (36GB, 96GB, 400GB).


Here are the commands that work for the smaller client collection to backup
and restore:

https://solr-dev.domain.com/solr/admin/collections?action=BACKUP&repository=s3&collection=client&name=2024-09-30&location=/

https://solr-dev.domain.com/solr/admin/collections?action=RESTORE&repository=s3&collection=client&name=2024-09-30&location=/


Collection sizes:

Works:
client_s1r2:  4.3Mb

Fails:
people_s1r2:  96.6Gb
questions_s1r2:  36.1Gb


The difference in the backups is that, for the failing ones, it doesn't
create backup_0.properties and zk_backup_0 in S3:

Good:
```
domain-solr-backups/2024-09-30/client/

backup_0.properties 322B
index/
shard_backup_metadata/
zk_backup_0/
```

Bad:
```
domain-solr-backups/2024-09-30/people/

index/
shard_backup_metadata/
```

Notice in the bad case above we are missing `backup_0.properties` and
`zk_backup_0/` from the S3 listing.


Trying to restore the people collections gives me:

"msg": "No backup.properties was found, the backup does not exist or not
complete",

I also can somehow trigger an error where it tries to access the first
element of an array and hits a bounds check exception, but I don't have
that message handy at the moment.


Now, I also tried hitting the replication API endpoint with different
results.

https://solr-dev.domain.com/solr/questions/replication?command=backup&repository=s3&location=/&name=questions_backup-9-30

https://solr-dev.domain.com/solr/questions_v10/replication?command=restore&repository=s3&location=/&name=questions_backup-9-30

 It makes a different s3 directory listing with 169 files, which seem to be
segment names:
_179_1n.liv 758.3KB
_179.cfe 479B
_179.cfs 2.2GB
_179.si  391B
...

The restore command returns immediately but doesn't seem to restore
anything. However, I think it eventually did restore my 36GB `questions`
collection--I checked a week later and it had data somehow. But trying to
restore it again yesterday yielded no data in 24 hours. I don't think this
is the right API endpoint, but I'm confused why it works at all and makes a
different set of files.

Is there just a bug that the `backup.properties` file and zk dir are not
getting created?

Overall I enjoy working with SOLR. Thank you for your work!

-Doug

PS: I also noticed that I had some kind of reboot loop in one of my two
solr cloud machines where the heap usage was at 10% (96GB RAM for solr to
use) and then it would crash with the error `OpenJDK 64-Bit Server VM
warning: INFO: os::commit_memory(0x0000fffadb6a0000, 16384, 0) failed;
error='Not enough space' (errno=12)`. Deleting the offending core fixed the
loop. This might warrant a bug report but I don't know how to reliably
trigger it.

PPS: If solr 9.7.0 has such a speedup using Java 21 and higher as per the
release notes, why is the default docker image based on JRE 17?

"Apache Lucene upgraded to 9.11.1 introducing tremendous performance
improvements when using Java 21 for vector search among other things.
(SOLR-17325)"

https://github.com/apache/solr-docker/blob/main/9.7/Dockerfile#L17

Reply via email to