, "Poppe, Thomas (IP&Science)"
Subject: Re: CompiledAutomaton performance issue
This is just an optimization; maybe we should expose an option to disable it?
Or maybe we can find the common suffix on an NFA instead, to avoid
determinization?
Can you open a Jira issue so we can d
This is just an optimization; maybe we should expose an option to disable
it?
Or maybe we can find the common suffix on an NFA instead, to avoid
determinization?
Can you open a Jira issue so we can discuss options?
Thanks,
Mike McCandless
http://blog.mikemccandless.com
On Fri, Dec 15, 2017 at
Hello,
We're using the automaton package as part of Elasticsearch for doing regexp
queries. Our business requires us to process rather complex regular
expressions, for example (we have more complex examples, but this one
illustrates the problem):
(¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,1
Hi Team,
I am a new user of Lucene 4.8.1. I encountered a Lucene indexing
performance issue which slow down my application greatly. I tried several
ways from google searchs but still couldn't resolve it. Any suggestions
from your experts might help me a lot.
One of my application uses the l
ashing (I/O) as Lucene accesses the index.
>
> -- Jack Krupansky
>
> -Original Message- From: Liviu Matei
> Sent: Monday, May 19, 2014 4:21 PM
> To: java-user@lucene.apache.org
> Subject: Performance issue when using multiple PhraseQueries against a 1+
> million
Does your index fit fully in system memory - the OS file cache? If not,
there could be a lot of thrashing (I/O) as Lucene accesses the index.
-- Jack Krupansky
-Original Message-
From: Liviu Matei
Sent: Monday, May 19, 2014 4:21 PM
To: java-user@lucene.apache.org
Subject: Performance
Hi,
In order to achieve a somehow "smarter" search that takes into
consideration also the context I decided to use PhraseQuery. Now I create
~100 phrase queries from the input text and combine them with boolean query
into one query and issue a search against the index.
Now if the index size is big
Hello,
We have a technical issue with our usage of lucene that let us puzzle about the
possible source.
To specified the issue, we have an application with good time of response on
search but after a certain amount of time from a few hours to a few days the
search that were taking a few hundred
;>>
>>>
>>>
>>> On Fri, Mar 15, 2013 at 7:36 PM, Lin Ma wrote:
>>>
>>>> Hi lukai, thanks for the reply. Do you mean WAND is a way to resolve
>>>> this issue? For "native support", do you mean there is no built-
r 16, 2013 at 2:49 AM, lukai wrote:
> I had implemented wand with solr/lucene. So far there is no performance
> issue. There is no native support for this functionality, you need to
> implement it by yourself..
>
> On Fri, Mar 15, 2013 at 10:09 AM, Lin Ma wrote:
>
> > He
I had implemented wand with solr/lucene. So far there is no performance
issue. There is no native support for this functionality, you need to
implement it by yourself..
On Fri, Mar 15, 2013 at 10:09 AM, Lin Ma wrote:
> Hello guys,
>
> Supposing I have one million documents, and each
Hello guys,
Supposing I have one million documents, and each document has hundreds of
features. For a given query, it also has hundreds of features. I want to
fetch most relevant top 1000 documents by dot product related features of
query and documents (query/document features are in the same feat
I'm resurrecting this old thread because this issue is now reaching a
critical point for us and I'm going to have to modify the Lucene source code
for it to continue to work for us.
Just a quick refresher: we have one index with several hundred thousand
unqiue field names and found that opening an
I've tried the suggestion below, but it really doesn't seem to have any impact.
I guess that's not surprising since 80% of the CPU time when I ran hprof was in
String.intern(), not in the StringHelper class.
Clearly, if I'm going to hack things up at this level, I've got some work do
to, inclu
On Fri, Nov 19, 2010 at 5:41 PM, Mark Kristensson
wrote:
> Here's the changes I made to org.apache.lucene.util.StringHelper:
>
> //public static StringInterner interner = new SimpleStringInterner(1024,8);
As Mike said, the real fix for trunk is to get rid of interning.
But for your version, you
Also, you'd have to synchronize access to the HashMap.
But it is surprising intern is this much of a performance hog that you
can shave ~7 seconds of IR init time.
We've talked about removing the interning of field names, especially
with flexible indexing (4.0) where fields and term text are now
I actually think that the main reason for interning the field names in
Lucene is for comparison purposes and not to guarantee uniqueness (though
you get both). You will see many places in the Lucene's code where the field
name is compared using != operator instead of equals.
BTW, in your patch abo
My findings from the hprof results which showed 80% of the CPU time being in
String.intern() led me to do some reading about String.intern() and what I
found surprised me.
First, there are some very strong feelings about String.intern() and its value.
First, is this guy
(http://www.codeinstruc
I finally bucked up and made the change to CheckIndex to verify that I do not,
in fact, have any fields with norms in this index. The result is below - the
largest segment currently is #3, which 300,000+ fields but no norms.
-Mark
Segments file=segments_acew numSegments=9 version=FORMAT_DIAGN
Sure,
There is only one stack trace (that seems to be how the output for this tool
works) for java.lang.String.intern:
TRACE 300165:
java.lang.String.intern(:Unknown line)
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:74)
org.apache.lucene.
Lucene interns field names... since you have a truly enormous number
of unique fields it's expected intern will be called alot.
But that said it's odd that it's this costly.
Can you post the stack traces that call intern?
Mike
On Fri, Nov 5, 2010 at 1:53 PM, Michael McCandless
wrote:
> Hmm...
After a week away, I'm back and still working to get to the bottom of this
issue. We run Lucene from the binaries, so making changes to the source code is
not something we are really setup to do right now.
I have, however, created a trivial Java app that just opens an IndexReader for
our proble
Hmm...
So, I was going on this output from your CheckIndex:
test: field norms.OK [296713 fields]
But in fact I just looked and that number is bogus -- it's always
equal to total number of fields, not number of fields with norms
enabled. I'll open an issue to fix this, but in the mean
While most of our Lucene indexes are used for more traditional searching, this
index in particular is used more like a reporting repository. Thus, we really
do need to have that many fields indexed and they do need to be broken out into
separate fields. There may be another way to structure the
Likely what happened is you had a bunch of smaller segments, and then
suddenly they got merged into that one big segment (_aiaz) in your
index.
The representation for norms in particular is not sparse, so this
means the size of the norms file for a given segment will be
number-of-unique-indexed-fi
Yes, we do have a large number of unique field names in that index, because
they are driven by user named fields in our application (with some cleaning to
remove illegal chars).
This slowness problem has appeared very suddenly in the last couple of weeks
and the number of unique field names has
On Wed, Nov 3, 2010 at 4:27 PM, Mark Kristensson
wrote:
>
> I've run checkIndex against the index and the results are below. That net is
> that it's telling me nothing is wrong with the index.
Thanks.
> I did not have any instrumentation around the opening of the IndexSearcher
> (we don't use
I've run checkIndex against the index and the results are below. That net is
that it's telling me nothing is wrong with the index.
I did not have any instrumentation around the opening of the IndexSearcher (we
don't use an IndexReader), just around the actual query execution so I had to
add so
I'd even offer, if the index is small, perhaps you can post it
somewhere for us to download and debug trace commit()…
Also, though not very scientific, you can turn on debug messages by
setting an infoSfream and observe which print take the most to appear.
Not very accurate but if there's one oper
Can you run CheckIndex (command line tool) and post the output?
How long does it take to open a reader on this same index, and perform
a simple query (eg TermQuery)?
Mike
On Wed, Nov 3, 2010 at 2:53 PM, Mark Kristensson
wrote:
> I've successfully reproduced the issue in our lab with a copy from
> It turns out that the prepareCommit() is the slow call here, taking several
> seconds to complete.
>
> I've done some reading about it, but have not found anything that might be
> helpful here. The fact that it is slow
> every single time, even when I'm adding exactly one document to the index,
I've successfully reproduced the issue in our lab with a copy from production
and have broken the close() call into parts, as suggested, with one addition.
Previously, the call was simply
...
} finally {
// Close
if (indexWriter != null) {
Wonderful information on what happens during indexWriter.close(), thank you
very much! I've got some testing to do as a result.
We are on Lucene 3.0.0 right now.
One other detail that I neglected to mention is that the batch size does not
seem to have any relation to the time it takes to close
When you close IndexWriter, it performs several operations that might have a
connection to the problem you describe:
* Commit all the pending updates -- if your update batch size is more or
less the same (i.e., comparable # of docs and total # bytes indexed), then
you should not see a performance
Hello,
One of our Lucene indexes has started misbehaving on indexWriter.close and I'm
searching for ideas about what may have happened and how to fix it.
Here's our scenario:
- We have seven Lucene indexes that contain different sets of data from a web
application are indexed for searching by
t;>> Regards,
>>> Sourabh Mittal
>>> Morgan Stanley | IDEAS Practice Areas
>>> Manikchand Ikon | South Wing 18 | Dhole Patil Road
>>> Pune, 411001
>>> Phone: +91 20 2620-7053
>>> sourabh-931.mit...@morganstanley.com
>>>
>>>
>>>
>>>
Do you NEED to be using 7 fields here?
Like Erick said, if you could give us an example of the types of data
you are trying to search against, it would be quite helpful.
Its possible that you might be able to say collapse your 7 fields down
to a single field, which would likely reduce the ove
Prefix queries are expensive here. The problem is
that each one forms a very large OR clause on all
the terms that start with those two letters. For instance,
if a field in your index contained
mine
milanta
mica
a prefix search on "mi" would form
mine OR milanta OR mica.
Doing this across seven f
Can you give us more info on what they are searching for w/ 2 letter
searches? Typically, prefix queries that short are going to have a
lot of terms to match. You might try having a field that you index
using a variation of ngrams that are anchored at the first character.
For example, en
Hi All,
We face serious performance issues when users do 2 letter search e.g ho,
jo, pa ma, um ar, ma fi etc. time taken between 10 - 15 secs.
Below is our implementation details:
1. Search performs on 7 fields.
2. PrefixQuery implementation on all fields
3. AND search.
4. Our indexer size is
@Erick: Yes I changed the default field, it is "bagofwords" now.
@Ian: Yes both indexes were optimized, and I didn't do any deletions.
version 2.4.0
I'll repeat the experiment, just be sure.
Mean while, do you have any document on Lucene fields? what I need to know
is how lucene is storing field
> ...
> I can for sure say that multiple copies are not index. But the number of
> fields in which text is divided are many. Can that be a reason?
Not for that amount of difference. You may be sure that you are not
indexing multiple copies, but I'm not. Convince me - create 2 new
indexes via the
Note that your two queries are different unless you've
changed the default operator.
Also, your bagOfWords query is searching across your
default field for the second two terms.
Your bagOfWords is really something like
bagOfWords:Alexander OR :history OR :Macedon.
Best
Erick
On Wed, Jan 21, 20
I agree with Ian that these times sound way too high. I'd
also ask whether you fire a few warmup searches at your
server before measuring the increased time, you might
just be seeing the cache being populated.
Best
Erick
On Wed, Jan 21, 2009 at 10:42 AM, Ian Lea wrote:
> Hi
>
>
> Space: 700Mb v
Hi,
thanks for the reply.
For the document, in my last mail..
multifieldQuery:
name: Alexander AND domain: history AND first_sentence: Macedon
Single field query:
bagOfWords: Alexander history Macedon
I can for sure say that multiple copies are not index. But the number of
fields in which text
Hi
Space: 700Mb vs 4.5Gb sounds way too big a difference. Are you sure
you aren't loading multiple copies of the data or something like that?
Queries: a 20 times slowdown for a multi field query also sounds way
too big. What do the simple and multi field queries look like?
--
Ian.
On Wed,
Hi,
I've indexed around half a million XML documents. Here is the document
sample:
cogito:Name
Alexander the Great
cogito:domain
ancient history
cogito:first_sentence
Alexander the Great (Greek: or Megas Alexandros; July 20 356 BC June 10 323
BC), also known as Alexander III
spinergywmy wrote:
I have posted this question before and this time I found that it could be
pdfbox problem and this pdfbox I downloaded doesn't use the log4j.jar. To
index the app 2.13mb pdf file took me 17s and total time to upload a file is
18s.
Re: PFDBox.
I have a 2.5Mb test file that
Grant Ingersoll wrote:
On Nov 30, 2006, at 10:54 AM, spinergywmy wrote:
For my scenario will be every time the users upload the single file, I
need to index that particular file. Previously was because the previous
version of pdfbox integrate with log4j.jar file and I believe is the
log4j.j
On Nov 30, 2006, at 10:54 AM, spinergywmy wrote:
Hi Grant,
Thanks for the tips. I will take ur adviced and look into the
link that u
send to me.
For my scenario will be every time the users upload the single
file, I
need to index that particular file. Previously was because the
f I'm wrong.
Thanks
regards,
Wooi Meng
--
View this message in context:
http://www.nabble.com/indexing-performance-issue-tf2730895.html#a7621903
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
--
re any way or others software than pdfbox to solve the
performance issue.
Thanks.
regards,
Wooi Meng
--
View this message in context: http://www.nabble.com/indexing-
performance-issue-tf2730895.html#a7617155
Sent from the Lucene - Java Users mailing list a
than pdfbox to solve the
performance issue.
Thanks.
regards,
Wooi Meng
--
View this message in context:
http://www.nabble.com/indexing-performance-issue-tf2730895.html#a7617155
Sent from the Lucene - Java Users mailing list archive at Nabbl
spinergywmy wrote:
Hi,
I having this indexing the pdf file performance issue. It took me more
than 10 sec to index a pdf file about 200kb. Is it because I only have a
segment file? How can I make the indexing performance better?
If you're using the log4j PDFBox jar file, you must make
file performance issue. It took me more
than 10 sec to index a pdf file about 200kb. Is it because I only have a
segment file? How can I make the indexing performance better?
Thanks
regards,
Wooi Meng
-
To unsubscribe, e
wrote:
> I having this indexing the pdf file performance issue. It took me more
> than 10 sec to index a pdf file about 200kb. Is it because I only have a
> segment file? How can I make the indexing performance better?
PDFBox (which I assume you are using) can be quite slow converting lar
On Friday 10 November 2006 12:18, spinergywmy wrote:
> I having this indexing the pdf file performance issue. It took me more
> than 10 sec to index a pdf file about 200kb. Is it because I only have a
> segment file? How can I make the indexing performance better?
PDFBox (which I assum
Hi,
I having this indexing the pdf file performance issue. It took me more
than 10 sec to index a pdf file about 200kb. Is it because I only have a
segment file? How can I make the indexing performance better?
Thanks
regards,
Wooi Meng
--
View this message in context:
http
Is this markedly faster than using an MMapDirectory? Copying all this
data into the Java heap (as RAMDirectory does) puts a tremendous burden
on the garbage collector. MMapDirectory should be nearly as fast, but
keeps the index out of the Java heap.
Doug
z shalev wrote:
I've rewritten
I've rewritten the RAM DIR to supprt 64 bit (still havent had time to add this
to lucene, hopefully in the coming months when i have a free second)
My question:
i have a machine with 4 GB RAM
i have a 3GB index file,
i successfully load the 3GB index into memory,
the
On Saturday 25 June 2005 04:26, jian chen wrote:
> Hi,
>
> I think Span query in general should do more work than simple Phrase
> query. Phrase query, in its simplest form, should just try to find all
> terms that are adjacent to each other. Meanwhile, Span query does not
> necessary be adjacent t
Hi,
I think Span query in general should do more work than simple Phrase
query. Phrase query, in its simplest form, should just try to find all
terms that are adjacent to each other. Meanwhile, Span query does not
necessary be adjacent to each other, but, with other words in between.
Therefore, I
Hi,
I'm comparing SpanNearQuery to PhraseQuery results and noticing about
an 8x difference on Linux. Is a SpanNearQuery doing 8x as much work?
I'm considering diving into the code if the results sounds unusual to people.
But if its really doing that much more work, I won't spend time optimiz
63 matches
Mail list logo