Hi Bruno

OK sounds like you could at least optimize step 2, couldn't you perhaps
make a little java program which takes the raw data, builds a solr doc
using the solrj lib, adds the missing values and massages any existing data
as needed, saves the whole solr doc to another location and also sends it
in to your pre-prod solr instance for testing. Once it is tested and
verified, the complete solr doc is easily picked up and sent into
production. That workflow might help you out. Just a thought...

thx
robi

On Tue, Apr 8, 2025 at 1:26 AM Bruno Mannina <bmann...@matheo-software.com>
wrote:

> Hi Robi,
>
> In fact, it's a little bit more complex.
>
> I need to do several processing before sending data to the production.
> PreProd is just here to test data before sending them to Prod to index.
>
> STEP 1 : I receive data from our provider ->
> STEP 2 : PreProcessing data by adding values (translation, fill missing
> data, convert, etc...) ->
> STEP 3 : test them (the format) in pre-prod ->
> STEP 4 : send them in production to index
>
> Then, re-indexing will take too much time that I haven't actually.
> Last year, I do a re-indexing in pre-prod and it takes 6 months...
> Step 2 and 3 take too much time... and I haven't keep the data after STEP
> 4. It's too big.
>
> It's not the re-indexing himself which takes time, but the preprocessing.
>
>
> Cordialement, Best Regards
> Bruno Mannina
> www.matheo-software.com
> www.patent-pulse.com
> Mob. +33 0 634 421 817
>
>
> -----Message d'origine-----
> De : Robi Petersen [mailto:robip...@gmail.com]
> Envoyé : lundi 7 avril 2025 02:12
> À : users@solr.apache.org
> Objet : Re: Solr error...
>
> Hi Bruno,
>
> As an aside, in general you'd want your staging (pre-prod) solr instance
> to exactly match your production solr instance in every way (like solr
> version) possible.
>
> Another thought is to have several indexing machines, each pointing at a
> portion of those 200M textfiles, to speed up indexing the entire corpus.
>
> Cheers
> Robi
>
> On Sat, Apr 5, 2025 at 4:08 AM Bruno Mannina <bmann...@matheo-software.com
> >
> wrote:
>
> > Hi Colvin,
> >
> > Thank for your answer and your link, I will see if I can solve my
> problem.
> >
> > I use a old solr, I know :'(.
> > This old version is used since several years and I have a huge set of
> > data (around 200M of textfile to index).
> > Re-indexing my set of data will take too much time for me (several week).
> >
> > It's a pre-production solr (I used a Solr 8.11.3 on my production).
> > This pre-production is used to check data before dumping in Production.
> >
> >
> > Cordialement, Best Regards
> > Bruno Mannina
> > www.matheo-software.com
> > www.patent-pulse.com
> > Mob. +33 0 634 421 817
> >
> >
> > -----Message d'origine-----
> > De : Colvin Cowie [mailto:colvin.cowie....@gmail.com]
> > Envoyé : vendredi 4 avril 2025 11:57
> > À : users@solr.apache.org
> > Objet : Re: Solr error...
> >
> > Hello,
> >
> > I think we might need some more context here, that is to say, why are
> > you using Solr 5.5.1? That was released in 2016 and is very much out
> > of date and unsupported (and will contain a number of critical CVEs).
> > So rather than trying to make it work, can you instead move to the
> > latest release (9.8.1)? A lot of things have changed in the last 9
> > years, so maybe consider it as a fresh start?
> >
> > By the sounds of the error, the *file* is corrupt now, that doesn't
> > mean the disk is corrupt. The reason for why that happened is probably
> > not going to be apparent, though if you go back through your logs you
> > might identify the cause.
> > A little googling of  org.apache.lucene.index.CorruptIndexException
> > suggests that you may be able to "fix" the corrupt index (and lose the
> > corrupted documents in the process)
> > https://stackoverflow.com/a/14934177
> >
> > But I would seriously recommend that you move to a supported version
> > and reindex your data from source instead either way.
> >
> >
> >
> > On Thu, 3 Apr 2025 at 23:58, Bruno Mannina
> > <bmann...@matheo-software.com>
> > wrote:
> >
> > > Hi All,
> > >
> > >
> > >
> > > I have on my new computer with a solr (5.5.1) a collection with an
> error.
> > >
> > > My new computer is 1.5 year old (4*4to Nvme)
> > >
> > >
> > >
> > > I check my disk and I have no error ?!
> > >
> > >
> > >
> > > Do you know if I can do something to solve it ?
> > >
> > >
> > >
> > > Many thanks for your help !
> > >
> > >
> > >
> > > The error message is:
> > >
> > >
> > >
> > > java.lang.IllegalStateException: this writer hit an unrecoverable
> > > error; cannot complete commit
> > >          at
> > > org.apache.lucene.index.IndexWriter.finishCommit(IndexWriter.java:2985)
> > >          at
> > >
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2970)
> > >          at
> > > org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2930)
> > >          at
> > >
> > > org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandl
> > > er
> > > 2.java
> > > :619)
> > >          at
> > >
> > org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1
> > 464)
> > >          at
> > > org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1264)
> > >          at
> > > java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> > > Source)
> > >          at java.util.concurrent.FutureTask.run(Unknown Source)
> > >          at
> > > java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> > > Source)
> > >          at java.util.concurrent.FutureTask.run(Unknown Source)
> > >          at
> > >
> > > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.
> > > run(Ex
> > > ecutorUtil.java:231)
> > >          at
> > > java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> > > Source)
> > >          at
> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> > > Source)
> > >          at java.lang.Thread.run(Unknown Source) Caused by:
> > > org.apache.lucene.index.CorruptIndexException: checksum failed
> > > (hardware problem?) : expected=d0a2833f actual=64e63211
> > >
> > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="C:\Users\U
> > > ti
> > > lisate
> > > ur\INDEX\FTCLAIMS\index\_8znd.cfs") [slice=_8znd.fdt]))
> > >          at
> > > org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:334)
> > >          at
> > >
> org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:451)
> > >          at
> > >
> > > org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.c
> > > he
> > > ckInte
> > > grity(CompressingStoredFieldsReader.java:669)
> > >          at
> > >
> > > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.m
> > > er
> > > ge(Com
> > > pressingStoredFieldsWriter.java:595)
> > >          at
> > >
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:177)
> > >          at
> > > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:83)
> > >          at
> > > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4075)
> > >          at
> > > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655)
> > >          at
> > >
> > > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentM
> > > er
> > > geSche
> > > duler.java:588)
> > >          at
> > >
> > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Con
> > > cu
> > > rrentM
> > > ergeScheduler.java:626)
> > >
> > >
> > >
> > >
> > >
> > > Cordialement, Best Regards
> > >
> > > Bruno Mannina
> > >
> > >  <http://www.matheo-software.com/> www.matheo-software.com
> > >
> > >  <http://www.patent-pulse.com/> www.patent-pulse.com
> > >
> > > Mob. +33 0 634 421 817
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Cet e-mail a été vérifié par le logiciel antivirus d'Avast.
> > > www.avast.com
> >
> >
> > --
> > Cet e-mail a été vérifié par le logiciel antivirus d'Avast.
> > www.avast.com
> >
>
>
> --
> Cet e-mail a été vérifié par le logiciel antivirus d'Avast.
> www.avast.com
>

Reply via email to