Re: IndexWriter.close() performance issue

2011-02-22 Thread Mark Kristensson
I'm resurrecting this old thread because this issue is now reaching a critical point for us and I'm going to have to modify the Lucene source code for it to continue to work for us. Just a quick refresher: we have one index with several hundred thousand unqiue field names and found that opening an

Re: IndexWriter.close() performance issue

2010-11-23 Thread Mark Kristensson
I've tried the suggestion below, but it really doesn't seem to have any impact. I guess that's not surprising since 80% of the CPU time when I ran hprof was in String.intern(), not in the StringHelper class. Clearly, if I'm going to hack things up at this level, I've got some work do to, inclu

Re: IndexWriter.close() performance issue

2010-11-20 Thread Yonik Seeley
On Fri, Nov 19, 2010 at 5:41 PM, Mark Kristensson wrote: > Here's the changes I made to org.apache.lucene.util.StringHelper: > >  //public static StringInterner interner = new SimpleStringInterner(1024,8); As Mike said, the real fix for trunk is to get rid of interning. But for your version, you

Re: IndexWriter.close() performance issue

2010-11-20 Thread Michael McCandless
Also, you'd have to synchronize access to the HashMap. But it is surprising intern is this much of a performance hog that you can shave ~7 seconds of IR init time. We've talked about removing the interning of field names, especially with flexible indexing (4.0) where fields and term text are now

Re: IndexWriter.close() performance issue

2010-11-19 Thread Shai Erera
I actually think that the main reason for interning the field names in Lucene is for comparison purposes and not to guarantee uniqueness (though you get both). You will see many places in the Lucene's code where the field name is compared using != operator instead of equals. BTW, in your patch abo

Re: IndexWriter.close() performance issue

2010-11-19 Thread Mark Kristensson
My findings from the hprof results which showed 80% of the CPU time being in String.intern() led me to do some reading about String.intern() and what I found surprised me. First, there are some very strong feelings about String.intern() and its value. First, is this guy (http://www.codeinstruc

Re: IndexWriter.close() performance issue

2010-11-18 Thread Mark Kristensson
I finally bucked up and made the change to CheckIndex to verify that I do not, in fact, have any fields with norms in this index. The result is below - the largest segment currently is #3, which 300,000+ fields but no norms. -Mark Segments file=segments_acew numSegments=9 version=FORMAT_DIAGN

Re: IndexWriter.close() performance issue

2010-11-17 Thread Mark Kristensson
Sure, There is only one stack trace (that seems to be how the output for this tool works) for java.lang.String.intern: TRACE 300165: java.lang.String.intern(:Unknown line) org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:74) org.apache.lucene.

Re: IndexWriter.close() performance issue

2010-11-17 Thread Michael McCandless
Lucene interns field names... since you have a truly enormous number of unique fields it's expected intern will be called alot. But that said it's odd that it's this costly. Can you post the stack traces that call intern? Mike On Fri, Nov 5, 2010 at 1:53 PM, Michael McCandless wrote: > Hmm...

Re: IndexWriter.close() performance issue

2010-11-17 Thread Mark Kristensson
After a week away, I'm back and still working to get to the bottom of this issue. We run Lucene from the binaries, so making changes to the source code is not something we are really setup to do right now. I have, however, created a trivial Java app that just opens an IndexReader for our proble

Re: IndexWriter.close() performance issue

2010-11-05 Thread Michael McCandless
Hmm... So, I was going on this output from your CheckIndex: test: field norms.OK [296713 fields] But in fact I just looked and that number is bogus -- it's always equal to total number of fields, not number of fields with norms enabled. I'll open an issue to fix this, but in the mean

Re: IndexWriter.close() performance issue

2010-11-05 Thread Mark Kristensson
While most of our Lucene indexes are used for more traditional searching, this index in particular is used more like a reporting repository. Thus, we really do need to have that many fields indexed and they do need to be broken out into separate fields. There may be another way to structure the

Re: IndexWriter.close() performance issue

2010-11-04 Thread Michael McCandless
Likely what happened is you had a bunch of smaller segments, and then suddenly they got merged into that one big segment (_aiaz) in your index. The representation for norms in particular is not sparse, so this means the size of the norms file for a given segment will be number-of-unique-indexed-fi

Re: IndexWriter.close() performance issue

2010-11-03 Thread Mark Kristensson
Yes, we do have a large number of unique field names in that index, because they are driven by user named fields in our application (with some cleaning to remove illegal chars). This slowness problem has appeared very suddenly in the last couple of weeks and the number of unique field names has

Re: IndexWriter.close() performance issue

2010-11-03 Thread Michael McCandless
On Wed, Nov 3, 2010 at 4:27 PM, Mark Kristensson wrote: > > I've run checkIndex against the index and the results are below. That net is > that it's telling me nothing is wrong with the index. Thanks. > I did not have any instrumentation around the opening of the IndexSearcher > (we don't use

Re: IndexWriter.close() performance issue

2010-11-03 Thread Mark Kristensson
I've run checkIndex against the index and the results are below. That net is that it's telling me nothing is wrong with the index. I did not have any instrumentation around the opening of the IndexSearcher (we don't use an IndexReader), just around the actual query execution so I had to add so

Re: IndexWriter.close() performance issue

2010-11-03 Thread Shai Erera
I'd even offer, if the index is small, perhaps you can post it somewhere for us to download and debug trace commit()… Also, though not very scientific, you can turn on debug messages by setting an infoSfream and observe which print take the most to appear. Not very accurate but if there's one oper

Re: IndexWriter.close() performance issue

2010-11-03 Thread Michael McCandless
Can you run CheckIndex (command line tool) and post the output? How long does it take to open a reader on this same index, and perform a simple query (eg TermQuery)? Mike On Wed, Nov 3, 2010 at 2:53 PM, Mark Kristensson wrote: > I've successfully reproduced the issue in our lab with a copy from

Re: IndexWriter.close() performance issue

2010-11-03 Thread Yonik Seeley
> It turns out that the prepareCommit() is the slow call here, taking several > seconds to complete. > > I've done some reading about it, but have not found anything that might be > helpful here. The fact that it is slow > every single time, even when I'm adding exactly one document to the index,

Re: IndexWriter.close() performance issue

2010-11-03 Thread Mark Kristensson
I've successfully reproduced the issue in our lab with a copy from production and have broken the close() call into parts, as suggested, with one addition. Previously, the call was simply ... } finally { // Close if (indexWriter != null) {

Re: IndexWriter.close() performance issue

2010-11-02 Thread Mark Kristensson
Wonderful information on what happens during indexWriter.close(), thank you very much! I've got some testing to do as a result. We are on Lucene 3.0.0 right now. One other detail that I neglected to mention is that the batch size does not seem to have any relation to the time it takes to close

Re: IndexWriter.close() performance issue

2010-11-02 Thread Shai Erera
When you close IndexWriter, it performs several operations that might have a connection to the problem you describe: * Commit all the pending updates -- if your update batch size is more or less the same (i.e., comparable # of docs and total # bytes indexed), then you should not see a performance

IndexWriter.close() performance issue

2010-11-01 Thread Mark Kristensson
Hello, One of our Lucene indexes has started misbehaving on indexWriter.close and I'm searching for ideas about what may have happened and how to fix it. Here's our scenario: - We have seven Lucene indexes that contain different sets of data from a web application are indexed for searching by