Hi Jose, Thank you for very informative response.
I have commented out the line of codes that does the forceMerge(50) and commit() while the indexing is happening. Also increase the ram buffer size iwc.setRAMBufferSizeMB(512.0); and after index is done, then only forceMerge and commit but this time with large merge segments, that is 50. if (writer != null && forceMerge) { writer.forceMerge(50); writer.commit(); } With these changed, the exceptions reported initially, is no longer happening. Thank you again. Jason On Tue, Apr 8, 2014 at 8:50 PM, Jose Carlos Canova < jose.carlos.can...@gmail.com> wrote: > Hi Jason, > > No, the StrackTrace shows clearly the cause of the errror occurred during > the merge into a single index file segment(forgeMerge parameter defines the > number of desired segments at end). > > During the indexing of a document, Lucene might decide to create a new > segment of the information extracted from a document that you have created > to index it, somewhere on > Lucene<http://lucene.apache.org/core/3_0_3/fileformats.html>documentation > has a description of each file extension and its usage by the > program. > > ForceMerge is an option: > > You can also avoid the "forceMerge" letting all segments "as is", the > retrieval of results will work as same manner, maybe a little slowly > because the IndexReader will be mounted over several "index segments" but > works as the same manner, which means the forceMerge to minimize the number > of index segments can be avoided without harm the search results. > > Regarding how to index files, > > I did something different to index files found on a directory structure. > I used the FileVisitor< > http://docs.oracle.com/javase/7/docs/api/java/nio/file/FileVisitor.html>to > accumulate which files would be targeted to index, which means first > scan the files, > then after the scan, extract their content using > tika<http://tika.apache.org/> (a > choice) to finally index them. > > With this you can avoid some memory issues and separate the "scan process > (locate the files)" from the content extraction process (tika extraction or > other file read routine) from the "index process(lucene)", because all of > them are > memory consuming (for example large pdf files of big string segments). > > The disadvantage is that a little bit slow process (if all tasks run's on > same jvm will obligate to coordinate all threads), but with advantage is > that permit you to divide the tasks into "sub tasks" and distribute them > using a cache or a message queue like "activemq< > http://activemq.apache.org/>", > subtasks using a "message queue" also lets you to distribute among > different processes (jvm's) and machines. On practice take a little bit > time since you have to write some blocks of line of code to manage all of > those subtasks. > > > > att. > > > > > On Tue, Apr 8, 2014 at 4:02 AM, Jason Wee <peich...@gmail.com> wrote: > > > Hello Jose, > > > > Thank you for your response, I took a closer look. Below are my > responses: > > > > > > > Seems that you want to force a max number of segments to 1, > > > > // you're done adding documents to it): > > // > > writer.forceMerge(1); > > > > writer.close(); > > > > Yes, the line of code is uncommented because we want to understand how > > it work when index big data sets. Should this be a concern? > > > > > > > On a previous thread someone answered that the number of segments will > > > affect the Index Size, and is not related with Index Integrity (like > size > > > of index may vary according with number of segments). > > > > okay, no idea what the above actually mean but I would guess perhaps > > the code we added, cause this exception? > > > > if (file.isDirectory()) { > > String[] files = file.list(); > > // an IO error could occur > > if (files != null) { > > for (int i = 0; i < files.length; i++) { > > indexDocs(writer, new File(file, files[i]), > > forceMerge); > > if (forceMerge && writer.hasPendingMerges()) > { > > if (i % 1000 == 0 && i != 0) { > > logger.trace("forcing merge now."); > > try { > > writer.forceMerge(50); > > writer.commit(); > > } catch (OutOfMemoryError e) { > > logger.error("out of memory > > during merging ", e); > > throw new > > OutOfMemoryError(e.toString()); > > } > > } > > } > > } > > } > > > > } else { > > FileInputStream fis; > > > > > > > Should be... > > > > > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46); > > > IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, > > > analyzer); > > > > yes, we were and still referencing lucene_46 in our analyzer. > > > > > > /Jason > > > > > > > > On Sat, Apr 5, 2014 at 9:01 PM, Jose Carlos Canova < > > jose.carlos.can...@gmail.com> wrote: > > > > > Seems that you want to force a max number of segments to 1, > > > On a previous thread someone answered that the number of segments will > > > affect the Index Size, and is not related with Index Integrity (like > size > > > of index may vary according with number of segments). > > > > > > on version 4.6 there is a small issue on sample that is > > > > > > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); > > > IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, > > > analyzer); > > > > > > > > > Should be... > > > > > > > > > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46); > > > IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, > > > analyzer); > > > > > > > > > With this probably the line related to the codec will change too. > > > > > > > > > > > > On Fri, Apr 4, 2014 at 3:52 AM, Jason Wee <peich...@gmail.com> wrote: > > > > > > > Hello again, > > > > > > > > A little background of our experiment. We are storing lucene (version > > > > 4.6.0) on top of cassandra. We are using the demo IndexFiles.java > from > > > the > > > > lucene with minor modification such that the directory used is > > reference > > > to > > > > the CassandraDirectory. > > > > > > > > With large dataset (that is, index more than 50000 of files), after > > index > > > > is done, and set forceMerge(1) and get the following exception. > > > > > > > > > > > > BufferedIndexInput readBytes [ERROR] bufferStart = '0' > bufferPosition = > > > > '1024' len = '9252' after = '10276' > > > > BufferedIndexInput readBytes [ERROR] length = '8192' > > > > caught a class java.io.IOException > > > > with message: background merge hit exception: _1(4.6):c10250 > > > > _0(4.6):c10355 _2(4.6):c10297 _3(4.6):c10217 _4(4.6):c8882 into _5 > > > > [maxNumSegments=1] > > > > java.io.IOException: background merge hit exception: _1(4.6):c10250 > > > > _0(4.6):c10355 _2(4.6):c10297 _3(4.6):c10217 _4(4.6):c8882 into _5 > > > > [maxNumSegments=1] > > > > at > > > > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1755) > > > > at > > > > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1691) > > > > at > org.apache.lucene.store.IndexFiles.main(IndexFiles.java:159) > > > > Caused by: java.io.IOException: read past EOF: > > > > CassandraSimpleFSIndexInput(_1.nvd in path="_1.cfs" > > > slice=5557885:5566077) > > > > at > > > > > > > > > > > > > > org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:186) > > > > at > > > > > > > > > > > > > > org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:125) > > > > at > > > > > > > > > > > > > > org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.loadNumeric(Lucene42DocValuesProducer.java:230) > > > > at > > > > > > > > > > > > > > org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.getNumeric(Lucene42DocValuesProducer.java:186) > > > > at > > > > > > > > > > > > > > org.apache.lucene.index.SegmentCoreReaders.getNormValues(SegmentCoreReaders.java:159) > > > > at > > > > > > > > > > org.apache.lucene.index.SegmentReader.getNormValues(SegmentReader.java:516) > > > > at > > > > > > org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:232) > > > > at > > > > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:127) > > > > at > > > > > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4057) > > > > at > > > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3654) > > > > at > > > > > > > > > > > > > > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) > > > > at > > > > > > > > > > > > > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) > > > > > > > > > > > > We do not know what is wrong as our understanding on lucene is > limited. > > > Can > > > > someone give explanation on what is happening, or which might be the > > > > possible error source is? > > > > > > > > Thank you and any advice is appreciated. > > > > > > > > /Jason > > > > > > > > > >