Hi Jason,

No, the StrackTrace shows clearly the cause of the errror occurred during
the merge into a single index file segment(forgeMerge parameter defines the
number of desired segments at end).

During the indexing of a document, Lucene might decide to create a new
segment of the information extracted from a document that you have created
to index it, somewhere on
Lucene<http://lucene.apache.org/core/3_0_3/fileformats.html>documentation
has a description of each file extension and its usage by the
program.

ForceMerge is an option:

You can also avoid the "forceMerge" letting all segments "as is", the
retrieval of results will work as same manner, maybe a little slowly
because the IndexReader will be mounted over several "index segments" but
works as the same manner, which means the forceMerge to minimize the number
of index segments can be avoided without harm the search results.

Regarding how to index files,

I did something different to index files found on a directory structure.
I used the 
FileVisitor<http://docs.oracle.com/javase/7/docs/api/java/nio/file/FileVisitor.html>to
accumulate which files would be targeted to index, which means first
scan the files,
then after the scan, extract their content using
tika<http://tika.apache.org/> (a
choice) to finally index them.

With this you can avoid some memory issues and separate the "scan process
(locate the files)" from the content extraction process (tika extraction or
other file read routine)  from the "index process(lucene)", because all of
them are
memory consuming (for example large pdf files of big string segments).

The disadvantage is that a little bit slow process (if all tasks run's on
same jvm will obligate to coordinate all threads), but with advantage is
that permit you to divide the tasks into "sub tasks" and distribute them
using a cache or a message queue like "activemq<http://activemq.apache.org/>",
 subtasks using a "message queue" also lets you to distribute among
different processes (jvm's) and machines. On practice take a little bit
time since you have to write some blocks of line of code to manage all of
those subtasks.



att.




On Tue, Apr 8, 2014 at 4:02 AM, Jason Wee <peich...@gmail.com> wrote:

> Hello Jose,
>
> Thank you for your response, I took a closer look. Below are my responses:
>
>
> > Seems that you want to force a max number of segments to 1,
>
>       // you're done adding documents to it):
>       //
>       writer.forceMerge(1);
>
>       writer.close();
>
> Yes, the line of code is uncommented because we want to understand how
> it work when index big data sets. Should this be a concern?
>
>
> > On a previous thread someone answered that the number of segments will
> > affect the Index Size, and is not related with Index Integrity (like size
> > of index may vary according with number of segments).
>
> okay, no idea what the above actually mean but I would guess perhaps
> the code we added, cause this exception?
>
>               if (file.isDirectory()) {
>                     String[] files = file.list();
>                     // an IO error could occur
>                     if (files != null) {
>                         for (int i = 0; i < files.length; i++) {
>                             indexDocs(writer, new File(file, files[i]),
>                                     forceMerge);
>                             if (forceMerge && writer.hasPendingMerges()) {
>                                 if (i % 1000 == 0 && i != 0) {
>                                     logger.trace("forcing merge now.");
>                                     try {
>                                         writer.forceMerge(50);
>                                         writer.commit();
>                                     } catch (OutOfMemoryError e) {
>                                         logger.error("out of memory
> during merging ", e);
>                                         throw new
> OutOfMemoryError(e.toString());
>                                     }
>                                 }
>                             }
>                         }
>                     }
>
>                 } else {
>                     FileInputStream fis;
>
>
> > Should be...
>
> > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
> >      IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46,
> > analyzer);
>
> yes, we were and still referencing lucene_46 in our analyzer.
>
>
> /Jason
>
>
>
> On Sat, Apr 5, 2014 at 9:01 PM, Jose Carlos Canova <
> jose.carlos.can...@gmail.com> wrote:
>
> > Seems that you want to force a max number of segments to 1,
> > On a previous thread someone answered that the number of segments will
> > affect the Index Size, and is not related with Index Integrity (like size
> > of index may vary according with number of segments).
> >
> > on version 4.6 there is a small issue on sample that is
> >
> > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
> >       IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40,
> > analyzer);
> >
> >
> > Should be...
> >
> >
> > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
> >       IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46,
> > analyzer);
> >
> >
> > With this probably the line related to the codec will change too.
> >
> >
> >
> > On Fri, Apr 4, 2014 at 3:52 AM, Jason Wee <peich...@gmail.com> wrote:
> >
> > > Hello again,
> > >
> > > A little background of our experiment. We are storing lucene (version
> > > 4.6.0) on top of cassandra. We are using the demo IndexFiles.java from
> > the
> > > lucene with minor modification such that the directory used is
> reference
> > to
> > > the CassandraDirectory.
> > >
> > > With large dataset (that is, index more than 50000 of files), after
> index
> > > is done, and set forceMerge(1) and get the following exception.
> > >
> > >
> > > BufferedIndexInput readBytes [ERROR] bufferStart = '0' bufferPosition =
> > > '1024' len = '9252' after = '10276'
> > > BufferedIndexInput readBytes [ERROR] length = '8192'
> > >  caught a class java.io.IOException
> > >  with message: background merge hit exception: _1(4.6):c10250
> > > _0(4.6):c10355 _2(4.6):c10297 _3(4.6):c10217 _4(4.6):c8882 into _5
> > > [maxNumSegments=1]
> > > java.io.IOException: background merge hit exception: _1(4.6):c10250
> > > _0(4.6):c10355 _2(4.6):c10297 _3(4.6):c10217 _4(4.6):c8882 into _5
> > > [maxNumSegments=1]
> > >         at
> > > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1755)
> > >         at
> > > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1691)
> > >         at org.apache.lucene.store.IndexFiles.main(IndexFiles.java:159)
> > > Caused by: java.io.IOException: read past EOF:
> > > CassandraSimpleFSIndexInput(_1.nvd in path="_1.cfs"
> > slice=5557885:5566077)
> > >         at
> > >
> > >
> >
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:186)
> > >         at
> > >
> > >
> >
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:125)
> > >         at
> > >
> > >
> >
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.loadNumeric(Lucene42DocValuesProducer.java:230)
> > >         at
> > >
> > >
> >
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.getNumeric(Lucene42DocValuesProducer.java:186)
> > >         at
> > >
> > >
> >
> org.apache.lucene.index.SegmentCoreReaders.getNormValues(SegmentCoreReaders.java:159)
> > >         at
> > >
> >
> org.apache.lucene.index.SegmentReader.getNormValues(SegmentReader.java:516)
> > >         at
> > >
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:232)
> > >         at
> > > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:127)
> > >         at
> > > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4057)
> > >         at
> > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3654)
> > >         at
> > >
> > >
> >
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> > >         at
> > >
> > >
> >
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> > >
> > >
> > > We do not know what is wrong as our understanding on lucene is limited.
> > Can
> > > someone give explanation on what is happening, or which might be the
> > > possible error source is?
> > >
> > > Thank you and any advice is appreciated.
> > >
> > > /Jason
> > >
> >
>

Reply via email to