This isn't normal. A mergeFactor of 150 is way too high; I'd put that back to 10 and see if the problem persists. Also make sure you're using autoCommit=false, and try the suggestions here:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed You're sure the JRE's heap size is big enough? If the problem persists... can you turn on IndexWriter's infoStream and post the resulting output leading up to the 100% CPU? You might also try "kill -QUIT" when the 100% CPU problem is happening, to catch the stack trace of all threads, and post that too... Mike On Mon, Jun 8, 2009 at 6:23 AM, Mateusz Berezecki<mateu...@gmail.com> wrote: > Hi list, > > I'm having a trouble with achieving good performance when indexing XML > wikipedia dump. > The indexing process works as follows > > 1. setup FSDirectory > 2. setup IndexWriter > 3. setup custom analyzer chaining wikipediatokenizer, lowercasefilter, > porterstemmer, stopfilter and lengthfilter > 3. create XMLStreamReader that reads from XML file > 4. run the parser and get <text> tag contents as well as <title> > contents and insert them into Document > 5. add document to the index > > the options for the writer are > - compound file is turned off > - merge factor set to 150 > - ram buffer size is set to 300 MB > > in addition to that the XML stream is read using bufferedfilereader > with buffer size of 100 MB > > This all works good for the first couple of minutes indexing extracted > articles very quickly but later on some problems start to show. The > symptoms are: > - the CPU is at 100% and the stream reading and indexing seems to be stopped > - the application seems to be dead > - it resumes after some time (anywhere between 1 to 40 minutes) > > I've double checked my code for any problems and even rewritten it a > couple of times so this makes me think that there's some problem in > lucene itself. The problem is persistent in both 2.4.1 and 2.9-dev > versions. > > Is there any known bug related to long running batch indexing > processes that operate on large documents? In my case the single XML > file is 20 GB and I'm just surprised how quickly the performance of > the indexer degrades. > > Do you have any suggestions? > > best, > Mateusz Berezecki > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org