Hi,

I am currently benchmarking various compression algorithms using the Sep Codec, but I got index corruption exception during the merge process, and I would need your help to debug it.

I have reimplemented various algorithms like FOR, Simple9, VInt, PFor for the Sep IntBlock Codec. I am benchmarking them now on the wikipedia dataset. For some algorithms, FOR, Simple9, etc., I don't encounter problems. But using the PFor algorithms, I get a CorruptedIndex exception during the merge process (in SepPostingsWriterImpl#startDoc), because document are out of order:

Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.index.CorruptIndexException: docs out of order (153 <= 153 ) at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:471) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:435) Caused by: org.apache.lucene.index.CorruptIndexException: docs out of order (153 <= 153 ) at org.apache.lucene.index.codecs.sep.SepPostingsWriterImpl.startDoc(SepPostingsWriterImpl.java:177)

However, this is happening only when I tried to index the wikipedia dataset using the PFor algorithm. I have tried to recreate the error using a unit test, creating random document, and performing a merge, but in this case the error does not appear.

After some debug, I have noticed that the document id at the end of a segment is the same than (or inferior to) the document id of the next segment to be merged. However, even by activating Codec.DEBUG=true, I am unable to know which are the faulty segments, and the faulty terms inside these segments. Could you indicate me a easy way to get this information, so I will be able to check these segments and their encoded blocks in order to find and understand the problem ?

Thanks in advance,
--
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to