I'm resurrecting this old thread because this issue is now reaching a
critical point for us and I'm going to have to modify the Lucene source code
for it to continue to work for us.
Just a quick refresher: we have one index with several hundred thousand
unqiue field names and found that opening an
I've tried the suggestion below, but it really doesn't seem to have any impact.
I guess that's not surprising since 80% of the CPU time when I ran hprof was in
String.intern(), not in the StringHelper class.
Clearly, if I'm going to hack things up at this level, I've got some work do
to, inclu
On Fri, Nov 19, 2010 at 5:41 PM, Mark Kristensson
wrote:
> Here's the changes I made to org.apache.lucene.util.StringHelper:
>
> //public static StringInterner interner = new SimpleStringInterner(1024,8);
As Mike said, the real fix for trunk is to get rid of interning.
But for your version, you
Also, you'd have to synchronize access to the HashMap.
But it is surprising intern is this much of a performance hog that you
can shave ~7 seconds of IR init time.
We've talked about removing the interning of field names, especially
with flexible indexing (4.0) where fields and term text are now
I actually think that the main reason for interning the field names in
Lucene is for comparison purposes and not to guarantee uniqueness (though
you get both). You will see many places in the Lucene's code where the field
name is compared using != operator instead of equals.
BTW, in your patch abo
My findings from the hprof results which showed 80% of the CPU time being in
String.intern() led me to do some reading about String.intern() and what I
found surprised me.
First, there are some very strong feelings about String.intern() and its value.
First, is this guy
(http://www.codeinstruc
I finally bucked up and made the change to CheckIndex to verify that I do not,
in fact, have any fields with norms in this index. The result is below - the
largest segment currently is #3, which 300,000+ fields but no norms.
-Mark
Segments file=segments_acew numSegments=9 version=FORMAT_DIAGN
Sure,
There is only one stack trace (that seems to be how the output for this tool
works) for java.lang.String.intern:
TRACE 300165:
java.lang.String.intern(:Unknown line)
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:74)
org.apache.lucene.
Lucene interns field names... since you have a truly enormous number
of unique fields it's expected intern will be called alot.
But that said it's odd that it's this costly.
Can you post the stack traces that call intern?
Mike
On Fri, Nov 5, 2010 at 1:53 PM, Michael McCandless
wrote:
> Hmm...
After a week away, I'm back and still working to get to the bottom of this
issue. We run Lucene from the binaries, so making changes to the source code is
not something we are really setup to do right now.
I have, however, created a trivial Java app that just opens an IndexReader for
our proble
Hmm...
So, I was going on this output from your CheckIndex:
test: field norms.OK [296713 fields]
But in fact I just looked and that number is bogus -- it's always
equal to total number of fields, not number of fields with norms
enabled. I'll open an issue to fix this, but in the mean
While most of our Lucene indexes are used for more traditional searching, this
index in particular is used more like a reporting repository. Thus, we really
do need to have that many fields indexed and they do need to be broken out into
separate fields. There may be another way to structure the
Likely what happened is you had a bunch of smaller segments, and then
suddenly they got merged into that one big segment (_aiaz) in your
index.
The representation for norms in particular is not sparse, so this
means the size of the norms file for a given segment will be
number-of-unique-indexed-fi
Yes, we do have a large number of unique field names in that index, because
they are driven by user named fields in our application (with some cleaning to
remove illegal chars).
This slowness problem has appeared very suddenly in the last couple of weeks
and the number of unique field names has
On Wed, Nov 3, 2010 at 4:27 PM, Mark Kristensson
wrote:
>
> I've run checkIndex against the index and the results are below. That net is
> that it's telling me nothing is wrong with the index.
Thanks.
> I did not have any instrumentation around the opening of the IndexSearcher
> (we don't use
I've run checkIndex against the index and the results are below. That net is
that it's telling me nothing is wrong with the index.
I did not have any instrumentation around the opening of the IndexSearcher (we
don't use an IndexReader), just around the actual query execution so I had to
add so
I'd even offer, if the index is small, perhaps you can post it
somewhere for us to download and debug trace commit()…
Also, though not very scientific, you can turn on debug messages by
setting an infoSfream and observe which print take the most to appear.
Not very accurate but if there's one oper
Can you run CheckIndex (command line tool) and post the output?
How long does it take to open a reader on this same index, and perform
a simple query (eg TermQuery)?
Mike
On Wed, Nov 3, 2010 at 2:53 PM, Mark Kristensson
wrote:
> I've successfully reproduced the issue in our lab with a copy from
> It turns out that the prepareCommit() is the slow call here, taking several
> seconds to complete.
>
> I've done some reading about it, but have not found anything that might be
> helpful here. The fact that it is slow
> every single time, even when I'm adding exactly one document to the index,
I've successfully reproduced the issue in our lab with a copy from production
and have broken the close() call into parts, as suggested, with one addition.
Previously, the call was simply
...
} finally {
// Close
if (indexWriter != null) {
Wonderful information on what happens during indexWriter.close(), thank you
very much! I've got some testing to do as a result.
We are on Lucene 3.0.0 right now.
One other detail that I neglected to mention is that the batch size does not
seem to have any relation to the time it takes to close
When you close IndexWriter, it performs several operations that might have a
connection to the problem you describe:
* Commit all the pending updates -- if your update batch size is more or
less the same (i.e., comparable # of docs and total # bytes indexed), then
you should not see a performance
Hello,
One of our Lucene indexes has started misbehaving on indexWriter.close and I'm
searching for ideas about what may have happened and how to fix it.
Here's our scenario:
- We have seven Lucene indexes that contain different sets of data from a web
application are indexed for searching by
23 matches
Mail list logo