Hi everyone, I'm running solr 7.4.0 and have a collection running on 4 nodes (2 shards, replication factor =2). I'm experiencing an issue where random nodes will crash when I submit large batches to be indexed (>500,000 documents). I've been successful in keeping things running if I keep an eye on it and restart nodes after they crash. Sometimes I end up with a non-recoverable replicated shard which I fix by dropping the replica and re-adding.
I've also been successful, no crashing, if I batch inserts in sizes < 500,000 documents, so that's my workaround for now. I'm wondering if anyone can help point me in the right direction for troubleshooting this issue, so that I can send upwards of 100m documents at a time. >From the logs, I have the following errors: SolrException.java:148) - java.io.EOFException org.apache.solr.update.ErrorReportingConcurrentUpdateSolrClient (StreamingSolrClients.java:147) - error I did see this: https://solr.apache.org/guide/7_3/taking-solr-to-production.html#file-handles-and-processes-ulimit-settings I'm running RHEL, does this look correctly configured? ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1544093 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited cat /proc/sys/fs/file-max 39208945 I was thinking of scheduling a job to log the output of cat /proc/sys/fs/file-nr every 5 minutes or so on my next attempt in an attempt to validate this setting is not an issue. Any other ideas? TIA, Jon