Multi-threaded indexing can speed things up. Use two threads per CPU to get maximum throughput. I wrote a simple Python program to do that.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 6, 2025, at 5:11 PM, Robi Petersen <robip...@gmail.com> wrote: > > Hi Bruno, > > As an aside, in general you'd want your staging (pre-prod) solr instance to > exactly match your production solr instance in every way (like solr > version) possible. > > Another thought is to have several indexing machines, each pointing at a > portion of those 200M textfiles, to speed up indexing the entire corpus. > > Cheers > Robi > > On Sat, Apr 5, 2025 at 4:08 AM Bruno Mannina <bmann...@matheo-software.com> > wrote: > >> Hi Colvin, >> >> Thank for your answer and your link, I will see if I can solve my problem. >> >> I use a old solr, I know :'(. >> This old version is used since several years and I have a huge set of data >> (around 200M of textfile to index). >> Re-indexing my set of data will take too much time for me (several week). >> >> It's a pre-production solr (I used a Solr 8.11.3 on my production). >> This pre-production is used to check data before dumping in Production. >> >> >> Cordialement, Best Regards >> Bruno Mannina >> www.matheo-software.com >> www.patent-pulse.com >> Mob. +33 0 634 421 817 >> >> >> -----Message d'origine----- >> De : Colvin Cowie [mailto:colvin.cowie....@gmail.com] >> Envoyé : vendredi 4 avril 2025 11:57 >> À : users@solr.apache.org >> Objet : Re: Solr error... >> >> Hello, >> >> I think we might need some more context here, that is to say, why are you >> using Solr 5.5.1? That was released in 2016 and is very much out of date >> and unsupported (and will contain a number of critical CVEs). >> So rather than trying to make it work, can you instead move to the latest >> release (9.8.1)? A lot of things have changed in the last 9 years, so maybe >> consider it as a fresh start? >> >> By the sounds of the error, the *file* is corrupt now, that doesn't mean >> the disk is corrupt. The reason for why that happened is probably not going >> to be apparent, though if you go back through your logs you might identify >> the cause. >> A little googling of org.apache.lucene.index.CorruptIndexException >> suggests that you may be able to "fix" the corrupt index (and lose the >> corrupted documents in the process) https://stackoverflow.com/a/14934177 >> >> But I would seriously recommend that you move to a supported version and >> reindex your data from source instead either way. >> >> >> >> On Thu, 3 Apr 2025 at 23:58, Bruno Mannina <bmann...@matheo-software.com> >> wrote: >> >>> Hi All, >>> >>> >>> >>> I have on my new computer with a solr (5.5.1) a collection with an error. >>> >>> My new computer is 1.5 year old (4*4to Nvme) >>> >>> >>> >>> I check my disk and I have no error ?! >>> >>> >>> >>> Do you know if I can do something to solve it ? >>> >>> >>> >>> Many thanks for your help ! >>> >>> >>> >>> The error message is: >>> >>> >>> >>> java.lang.IllegalStateException: this writer hit an unrecoverable >>> error; cannot complete commit >>> at >>> org.apache.lucene.index.IndexWriter.finishCommit(IndexWriter.java:2985) >>> at >>> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2970) >>> at >>> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2930) >>> at >>> >>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler >>> 2.java >>> :619) >>> at >>> >> org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1464) >>> at >>> org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1264) >>> at >>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown >>> Source) >>> at java.util.concurrent.FutureTask.run(Unknown Source) >>> at >>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown >>> Source) >>> at java.util.concurrent.FutureTask.run(Unknown Source) >>> at >>> >>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1. >>> run(Ex >>> ecutorUtil.java:231) >>> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown >>> Source) >>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown >>> Source) >>> at java.lang.Thread.run(Unknown Source) Caused by: >>> org.apache.lucene.index.CorruptIndexException: checksum failed >>> (hardware problem?) : expected=d0a2833f actual=64e63211 >>> >>> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="C:\Users\Uti >>> lisate >>> ur\INDEX\FTCLAIMS\index\_8znd.cfs") [slice=_8znd.fdt])) >>> at >>> org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:334) >>> at >>> org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:451) >>> at >>> >>> org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.che >>> ckInte >>> grity(CompressingStoredFieldsReader.java:669) >>> at >>> >>> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.mer >>> ge(Com >>> pressingStoredFieldsWriter.java:595) >>> at >>> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:177) >>> at >>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:83) >>> at >>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4075) >>> at >>> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655) >>> at >>> >>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMer >>> geSche >>> duler.java:588) >>> at >>> >>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concu >>> rrentM >>> ergeScheduler.java:626) >>> >>> >>> >>> >>> >>> Cordialement, Best Regards >>> >>> Bruno Mannina >>> >>> <http://www.matheo-software.com/> www.matheo-software.com >>> >>> <http://www.patent-pulse.com/> www.patent-pulse.com >>> >>> Mob. +33 0 634 421 817 >>> >>> >>> >>> >>> >>> -- >>> Cet e-mail a été vérifié par le logiciel antivirus d'Avast. >>> www.avast.com >> >> >> -- >> Cet e-mail a été vérifié par le logiciel antivirus d'Avast. >> www.avast.com >>