Hi,
I am currently benchmarking various compression algorithms using the Sep
Codec, but I got index corruption exception during the merge process,
and I would need your help to debug it.
I have reimplemented various algorithms like FOR, Simple9, VInt, PFor
for the Sep IntBlock Codec. I am benchmarking them now on the wikipedia
dataset. For some algorithms, FOR, Simple9, etc., I don't encounter
problems. But using the PFor algorithms, I get a CorruptedIndex
exception during the merge process (in SepPostingsWriterImpl#startDoc),
because document are out of order:
Exception in thread "Lucene Merge Thread #0"
org.apache.lucene.index.MergePolicy$MergeException:
org.apache.lucene.index.CorruptIndexException: docs out of order (153 <=
153 )
at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:471)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:435)
Caused by: org.apache.lucene.index.CorruptIndexException: docs out of
order (153 <= 153 )
at
org.apache.lucene.index.codecs.sep.SepPostingsWriterImpl.startDoc(SepPostingsWriterImpl.java:177)
However, this is happening only when I tried to index the wikipedia
dataset using the PFor algorithm. I have tried to recreate the error
using a unit test, creating random document, and performing a merge, but
in this case the error does not appear.
After some debug, I have noticed that the document id at the end of a
segment is the same than (or inferior to) the document id of the next
segment to be merged. However, even by activating Codec.DEBUG=true, I am
unable to know which are the faulty segments, and the faulty terms
inside these segments. Could you indicate me a easy way to get this
information, so I will be able to check these segments and their encoded
blocks in order to find and understand the problem ?
Thanks in advance,
--
Renaud Delbru
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org