On Fri, Sep 11, 2009 at 1:15 PM, <paul_murd...@emainc.com> wrote: > I've been testing out "paging" the document this past week. I'm > still working on getting a successful test and think I'm close. The > down side was a drastic slow down in indexing speed, and lots of > open files, but that was expected.
You mean a slowdown in indexing speed because you now flush after every page not after every document, right? That's expected. But I'm not sure why you'd see a change in the number of open files... > I tried with small mergeFactors, maxBufferedDocs(haven't tried 1 > though), and ramBufferSizeMB. Using JConsole to monitor the heap > usage, this method slowly creeps towards my max heap space until > OOM. I can say that at least some of the document gets indexed > before OOM. So I performed a heap dump at OOM and saw that > FreqProxTermsWriterPerField had by far consumed the most memory. I > haven't looked into that yet... It's at least ~60 bytes per unique term, not counting the char[] to hold the term, and FreqProxTermsWriterPerField is exactly where most of those bytes are allocated (eg its PostingList class). > Let's say I page the document into ten different smaller documents > and they are indexed successfully (I'm not quite at this point yet). > Is there a way to select documents by id and merge them into one > large document after they are in the index? That was my plan to > work around OOM and achieve the same end result as trying to index > the large document in one shot. You mean at search time right? You basically want the equivalent of SQL's "group by". You could make a custom Collector that does this... Or look at how Solr is iterating on field collapsing (https://issues.apache.org/jira/browse/SOLR-236)? Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org