Thanks Glen! I will take at your project. Unfortunately I will only have 512 MB to 1024 MB to work with as Lucene is only one component in a larger software system running on one machine. I agree with you on the C\C++ comment. That is what I would normally use for memory intense software. It turns out that the larger file you want to index is the larger the heap space you will need. What I would like to see is a way to "throttle" the indexing process to control the memory footprint. I understand that this will take longer, but if I perform the task during off hours it shouldn't matter. At least the file will be indexed correctly.
Thanks, Paul -----Original Message----- From: java-user-return-42272-paul_murdoch=emainc....@lucene.apache.org [mailto:java-user-return-42272-paul_murdoch=emainc....@lucene.apache.org] On Behalf Of Glen Newton Sent: Friday, September 11, 2009 9:53 AM To: java-user@lucene.apache.org Subject: Re: Indexing large files? - No answers yet... In this project: http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html I concatenate all the text of all of articles of a single journal into a single text file. This can create a text file that is 500MB in size. Lucene is OK in indexing files this size (in parallel even), but I have a heap size of 8GB. I would suggest increasing your heap to as large as your machine can reasonably take. The reality is that Java programs (like Lucene) take up more memory than a similar C or even C++ program. Java may approach C/C++ in speed, but not memory. We don't use Java because of its memory footprint! ;-) See: Programming language shootout: speed: http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=1&xmem=0&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1®exdna=1&revcomp=1&spectralnorm=1&threadring=0 Programming language shootout: memory: http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=0&xmem=1&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1®exdna=1&revcomp=1&spectralnorm=1&threadring=0 -glen 2009/9/11 Dan OConnor <docon...@acquiremedia.com>: > Paul: > > My first suggestion would be to update your JVM to the latest version (or at > least .14). There were several garbage collection related issues resolved in > version 10 - 13 (especially dealing with large heaps). > > Next, your IndexWriter parameters would help figure out why you are using so > much RAM > getMaxFieldLength() > getMaxBufferedDocs() > getMaxMergeDocs() > getRAMBufferSizeMB() > > How often are you calling commit? > Do you close your IndexWriter after every document? > How many documents of this size are you indexing? > Have you used luke to look at your index? > If this is a large index, have you optimized it recently? > Are there any searches going on while you are indexing? > > > Regards, > Dan > > > -----Original Message----- > From: paul_murd...@emainc.com [mailto:paul_murd...@emainc.com] > Sent: Friday, September 11, 2009 7:57 AM > To: java-user@lucene.apache.org > Subject: RE: Indexing large files? - No answers yet... > > This issue is still open. Any suggestions/help with this would be > greatly appreciated. > > Thanks, > > Paul > > > -----Original Message----- > From: java-user-return-42080-paul_murdoch=emainc....@lucene.apache.org > [mailto:java-user-return-42080-paul_murdoch=emainc....@lucene.apache.org > ] On Behalf Of paul_murd...@emainc.com > Sent: Monday, August 31, 2009 10:28 AM > To: java-user@lucene.apache.org > Subject: Indexing large files? > > Hi, > > > > I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07). I'm > consistently receiving "OutOfMemoryError: Java heap space", when trying > to index large text files. > > > > Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB > max. heap size. So I increased the max. heap size to 512 MB. This > worked for the 5 MB text file, but Lucene still used 84 MB of heap space > to do this. Why so much? > > > > The class FreqProxTermsWriterPerField appears to be the biggest memory > consumer by far according to JConsole and the TPTP Memory Profiling > plugin for Eclipse Ganymede. > > > > Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB > max. heap size. Increasing the max. heap size to 1024 MB works but > Lucene uses 826 MB of heap space while performing this. Still seems > like way too much memory is being used to do this. I'm sure larger > files would cause the error as it seems correlative. > > > > I'm on a Windows XP SP2 platform with 2 GB of RAM. So what is the best > practice for indexing large files? Here is a code snippet that I'm > using: > > > > // Index the content of a text file. > > private Boolean saveTXTFile(File textFile, Document textDocument) > throws CIDBException { > > > > try { > > > > Boolean isFile = textFile.isFile(); > > Boolean hasTextExtension = > textFile.getName().endsWith(".txt"); > > > > if (isFile && hasTextExtension) { > > > > System.out.println("File " + > textFile.getCanonicalPath() + " is being indexed"); > > Reader textFileReader = new > FileReader(textFile); > > if (textDocument == null) > > textDocument = new Document(); > > textDocument.add(new Field("content", > textFileReader)); > > indexWriter.addDocument(textDocument); > // BREAKS HERE!!!! > > } > > } catch (FileNotFoundException fnfe) { > > System.out.println(fnfe.getMessage()); > > return false; > > } catch (CorruptIndexException cie) { > > throw new CIDBException("The index has become > corrupt."); > > } catch (IOException ioe) { > > System.out.println(ioe.getMessage()); > > return false; > > } > > return true; > > } > > > > > > Thanks much, > > > > Paul > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- - --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org