Re: CompiledAutomaton performance issue

2017-12-18 Thread
, "Poppe, Thomas (IP&Science)" Subject: Re: CompiledAutomaton performance issue This is just an optimization; maybe we should expose an option to disable it? Or maybe we can find the common suffix on an NFA instead, to avoid determinization? Can you open a Jira issue so we can d

Re: CompiledAutomaton performance issue

2017-12-17 Thread Michael McCandless
This is just an optimization; maybe we should expose an option to disable it? Or maybe we can find the common suffix on an NFA instead, to avoid determinization? Can you open a Jira issue so we can discuss options? Thanks, Mike McCandless http://blog.mikemccandless.com On Fri, Dec 15, 2017 at

CompiledAutomaton performance issue

2017-12-15 Thread
Hello, We're using the automaton package as part of Elasticsearch for doing regexp queries. Our business requires us to process rather complex regular expressions, for example (we have more complex examples, but this one illustrates the problem): (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,1

Lucene Indexing performance issue

2014-10-22 Thread Jason Wu
Hi Team, I am a new user of Lucene 4.8.1. I encountered a Lucene indexing performance issue which slow down my application greatly. I tried several ways from google searchs but still couldn't resolve it. Any suggestions from your experts might help me a lot. One of my application uses the l

Re: Performance issue when using multiple PhraseQueries against a 1+ million entries index

2014-05-19 Thread Liviu Matei
ashing (I/O) as Lucene accesses the index. > > -- Jack Krupansky > > -Original Message- From: Liviu Matei > Sent: Monday, May 19, 2014 4:21 PM > To: java-user@lucene.apache.org > Subject: Performance issue when using multiple PhraseQueries against a 1+ > million

Re: Performance issue when using multiple PhraseQueries against a 1+ million entries index

2014-05-19 Thread Jack Krupansky
Does your index fit fully in system memory - the OS file cache? If not, there could be a lot of thrashing (I/O) as Lucene accesses the index. -- Jack Krupansky -Original Message- From: Liviu Matei Sent: Monday, May 19, 2014 4:21 PM To: java-user@lucene.apache.org Subject: Performance

Performance issue when using multiple PhraseQueries against a 1+ million entries index

2014-05-19 Thread Liviu Matei
Hi, In order to achieve a somehow "smarter" search that takes into consideration also the context I decided to use PhraseQuery. Now I create ~100 phrase queries from the input text and combine them with boolean query into one query and issue a search against the index. Now if the index size is big

Performance issue with Lucene 3.6.2

2014-04-24 Thread Yannick Gérault
Hello, We have a technical issue with our usage of lucene that let us puzzle about the possible source. To specified the issue, we have an application with good time of response on search but after a certain amount of time from a few hours to a few days the search that were taking a few hundred

Re: potential query performance issue

2013-03-16 Thread Lin Ma
;>> >>> >>> >>> On Fri, Mar 15, 2013 at 7:36 PM, Lin Ma wrote: >>> >>>> Hi lukai, thanks for the reply. Do you mean WAND is a way to resolve >>>> this issue? For "native support", do you mean there is no built-

Re: potential query performance issue

2013-03-15 Thread Lin Ma
r 16, 2013 at 2:49 AM, lukai wrote: > I had implemented wand with solr/lucene. So far there is no performance > issue. There is no native support for this functionality, you need to > implement it by yourself.. > > On Fri, Mar 15, 2013 at 10:09 AM, Lin Ma wrote: > > > He

Re: potential query performance issue

2013-03-15 Thread lukai
I had implemented wand with solr/lucene. So far there is no performance issue. There is no native support for this functionality, you need to implement it by yourself.. On Fri, Mar 15, 2013 at 10:09 AM, Lin Ma wrote: > Hello guys, > > Supposing I have one million documents, and each

potential query performance issue

2013-03-15 Thread Lin Ma
Hello guys, Supposing I have one million documents, and each document has hundreds of features. For a given query, it also has hundreds of features. I want to fetch most relevant top 1000 documents by dot product related features of query and documents (query/document features are in the same feat

Re: IndexWriter.close() performance issue

2011-02-22 Thread Mark Kristensson
I'm resurrecting this old thread because this issue is now reaching a critical point for us and I'm going to have to modify the Lucene source code for it to continue to work for us. Just a quick refresher: we have one index with several hundred thousand unqiue field names and found that opening an

Re: IndexWriter.close() performance issue

2010-11-23 Thread Mark Kristensson
I've tried the suggestion below, but it really doesn't seem to have any impact. I guess that's not surprising since 80% of the CPU time when I ran hprof was in String.intern(), not in the StringHelper class. Clearly, if I'm going to hack things up at this level, I've got some work do to, inclu

Re: IndexWriter.close() performance issue

2010-11-20 Thread Yonik Seeley
On Fri, Nov 19, 2010 at 5:41 PM, Mark Kristensson wrote: > Here's the changes I made to org.apache.lucene.util.StringHelper: > >  //public static StringInterner interner = new SimpleStringInterner(1024,8); As Mike said, the real fix for trunk is to get rid of interning. But for your version, you

Re: IndexWriter.close() performance issue

2010-11-20 Thread Michael McCandless
Also, you'd have to synchronize access to the HashMap. But it is surprising intern is this much of a performance hog that you can shave ~7 seconds of IR init time. We've talked about removing the interning of field names, especially with flexible indexing (4.0) where fields and term text are now

Re: IndexWriter.close() performance issue

2010-11-19 Thread Shai Erera
I actually think that the main reason for interning the field names in Lucene is for comparison purposes and not to guarantee uniqueness (though you get both). You will see many places in the Lucene's code where the field name is compared using != operator instead of equals. BTW, in your patch abo

Re: IndexWriter.close() performance issue

2010-11-19 Thread Mark Kristensson
My findings from the hprof results which showed 80% of the CPU time being in String.intern() led me to do some reading about String.intern() and what I found surprised me. First, there are some very strong feelings about String.intern() and its value. First, is this guy (http://www.codeinstruc

Re: IndexWriter.close() performance issue

2010-11-18 Thread Mark Kristensson
I finally bucked up and made the change to CheckIndex to verify that I do not, in fact, have any fields with norms in this index. The result is below - the largest segment currently is #3, which 300,000+ fields but no norms. -Mark Segments file=segments_acew numSegments=9 version=FORMAT_DIAGN

Re: IndexWriter.close() performance issue

2010-11-17 Thread Mark Kristensson
Sure, There is only one stack trace (that seems to be how the output for this tool works) for java.lang.String.intern: TRACE 300165: java.lang.String.intern(:Unknown line) org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:74) org.apache.lucene.

Re: IndexWriter.close() performance issue

2010-11-17 Thread Michael McCandless
Lucene interns field names... since you have a truly enormous number of unique fields it's expected intern will be called alot. But that said it's odd that it's this costly. Can you post the stack traces that call intern? Mike On Fri, Nov 5, 2010 at 1:53 PM, Michael McCandless wrote: > Hmm...

Re: IndexWriter.close() performance issue

2010-11-17 Thread Mark Kristensson
After a week away, I'm back and still working to get to the bottom of this issue. We run Lucene from the binaries, so making changes to the source code is not something we are really setup to do right now. I have, however, created a trivial Java app that just opens an IndexReader for our proble

Re: IndexWriter.close() performance issue

2010-11-05 Thread Michael McCandless
Hmm... So, I was going on this output from your CheckIndex: test: field norms.OK [296713 fields] But in fact I just looked and that number is bogus -- it's always equal to total number of fields, not number of fields with norms enabled. I'll open an issue to fix this, but in the mean

Re: IndexWriter.close() performance issue

2010-11-05 Thread Mark Kristensson
While most of our Lucene indexes are used for more traditional searching, this index in particular is used more like a reporting repository. Thus, we really do need to have that many fields indexed and they do need to be broken out into separate fields. There may be another way to structure the

Re: IndexWriter.close() performance issue

2010-11-04 Thread Michael McCandless
Likely what happened is you had a bunch of smaller segments, and then suddenly they got merged into that one big segment (_aiaz) in your index. The representation for norms in particular is not sparse, so this means the size of the norms file for a given segment will be number-of-unique-indexed-fi

Re: IndexWriter.close() performance issue

2010-11-03 Thread Mark Kristensson
Yes, we do have a large number of unique field names in that index, because they are driven by user named fields in our application (with some cleaning to remove illegal chars). This slowness problem has appeared very suddenly in the last couple of weeks and the number of unique field names has

Re: IndexWriter.close() performance issue

2010-11-03 Thread Michael McCandless
On Wed, Nov 3, 2010 at 4:27 PM, Mark Kristensson wrote: > > I've run checkIndex against the index and the results are below. That net is > that it's telling me nothing is wrong with the index. Thanks. > I did not have any instrumentation around the opening of the IndexSearcher > (we don't use

Re: IndexWriter.close() performance issue

2010-11-03 Thread Mark Kristensson
I've run checkIndex against the index and the results are below. That net is that it's telling me nothing is wrong with the index. I did not have any instrumentation around the opening of the IndexSearcher (we don't use an IndexReader), just around the actual query execution so I had to add so

Re: IndexWriter.close() performance issue

2010-11-03 Thread Shai Erera
I'd even offer, if the index is small, perhaps you can post it somewhere for us to download and debug trace commit()… Also, though not very scientific, you can turn on debug messages by setting an infoSfream and observe which print take the most to appear. Not very accurate but if there's one oper

Re: IndexWriter.close() performance issue

2010-11-03 Thread Michael McCandless
Can you run CheckIndex (command line tool) and post the output? How long does it take to open a reader on this same index, and perform a simple query (eg TermQuery)? Mike On Wed, Nov 3, 2010 at 2:53 PM, Mark Kristensson wrote: > I've successfully reproduced the issue in our lab with a copy from

Re: IndexWriter.close() performance issue

2010-11-03 Thread Yonik Seeley
> It turns out that the prepareCommit() is the slow call here, taking several > seconds to complete. > > I've done some reading about it, but have not found anything that might be > helpful here. The fact that it is slow > every single time, even when I'm adding exactly one document to the index,

Re: IndexWriter.close() performance issue

2010-11-03 Thread Mark Kristensson
I've successfully reproduced the issue in our lab with a copy from production and have broken the close() call into parts, as suggested, with one addition. Previously, the call was simply ... } finally { // Close if (indexWriter != null) {

Re: IndexWriter.close() performance issue

2010-11-02 Thread Mark Kristensson
Wonderful information on what happens during indexWriter.close(), thank you very much! I've got some testing to do as a result. We are on Lucene 3.0.0 right now. One other detail that I neglected to mention is that the batch size does not seem to have any relation to the time it takes to close

Re: IndexWriter.close() performance issue

2010-11-02 Thread Shai Erera
When you close IndexWriter, it performs several operations that might have a connection to the problem you describe: * Commit all the pending updates -- if your update batch size is more or less the same (i.e., comparable # of docs and total # bytes indexed), then you should not see a performance

IndexWriter.close() performance issue

2010-11-01 Thread Mark Kristensson
Hello, One of our Lucene indexes has started misbehaving on indexWriter.close and I'm searching for ideas about what may have happened and how to fix it. Here's our scenario: - We have seven Lucene indexes that contain different sets of data from a web application are indexed for searching by

Re: Performance issue

2009-02-03 Thread mittals
t;>> Regards, >>> Sourabh Mittal >>> Morgan Stanley | IDEAS Practice Areas >>> Manikchand Ikon | South Wing 18 | Dhole Patil Road >>> Pune, 411001 >>> Phone: +91 20 2620-7053 >>> sourabh-931.mit...@morganstanley.com >>> >>> >>> >>>

Re: Performance issue

2009-02-02 Thread Matthew Hall
Do you NEED to be using 7 fields here? Like Erick said, if you could give us an example of the types of data you are trying to search against, it would be quite helpful. Its possible that you might be able to say collapse your 7 fields down to a single field, which would likely reduce the ove

Re: Performance issue

2009-02-02 Thread Erick Erickson
Prefix queries are expensive here. The problem is that each one forms a very large OR clause on all the terms that start with those two letters. For instance, if a field in your index contained mine milanta mica a prefix search on "mi" would form mine OR milanta OR mica. Doing this across seven f

Re: Performance issue

2009-02-02 Thread Grant Ingersoll
Can you give us more info on what they are searching for w/ 2 letter searches? Typically, prefix queries that short are going to have a lot of terms to match. You might try having a field that you index using a variation of ngrams that are anchored at the first character. For example, en

Performance issue

2009-02-02 Thread Mittal, Sourabh (IDEAS)
Hi All, We face serious performance issues when users do 2 letter search e.g ho, jo, pa ma, um ar, ma fi etc. time taken between 10 - 15 secs. Below is our implementation details: 1. Search performs on 7 fields. 2. PrefixQuery implementation on all fields 3. AND search. 4. Our indexer size is

Re: Lucene Performance issue

2009-01-21 Thread Anshul jain
@Erick: Yes I changed the default field, it is "bagofwords" now. @Ian: Yes both indexes were optimized, and I didn't do any deletions. version 2.4.0 I'll repeat the experiment, just be sure. Mean while, do you have any document on Lucene fields? what I need to know is how lucene is storing field

Re: Lucene Performance issue

2009-01-21 Thread Ian Lea
> ... > I can for sure say that multiple copies are not index. But the number of > fields in which text is divided are many. Can that be a reason? Not for that amount of difference. You may be sure that you are not indexing multiple copies, but I'm not. Convince me - create 2 new indexes via the

Re: Lucene Performance issue

2009-01-21 Thread Erick Erickson
Note that your two queries are different unless you've changed the default operator. Also, your bagOfWords query is searching across your default field for the second two terms. Your bagOfWords is really something like bagOfWords:Alexander OR :history OR :Macedon. Best Erick On Wed, Jan 21, 20

Re: Lucene Performance issue

2009-01-21 Thread Erick Erickson
I agree with Ian that these times sound way too high. I'd also ask whether you fire a few warmup searches at your server before measuring the increased time, you might just be seeing the cache being populated. Best Erick On Wed, Jan 21, 2009 at 10:42 AM, Ian Lea wrote: > Hi > > > Space: 700Mb v

Re: Lucene Performance issue

2009-01-21 Thread Anshul jain
Hi, thanks for the reply. For the document, in my last mail.. multifieldQuery: name: Alexander AND domain: history AND first_sentence: Macedon Single field query: bagOfWords: Alexander history Macedon I can for sure say that multiple copies are not index. But the number of fields in which text

Re: Lucene Performance issue

2009-01-21 Thread Ian Lea
Hi Space: 700Mb vs 4.5Gb sounds way too big a difference. Are you sure you aren't loading multiple copies of the data or something like that? Queries: a 20 times slowdown for a multi field query also sounds way too big. What do the simple and multi field queries look like? -- Ian. On Wed,

Lucene Performance issue

2009-01-21 Thread Anshul jain
Hi, I've indexed around half a million XML documents. Here is the document sample: cogito:Name Alexander the Great cogito:domain ancient history cogito:first_sentence Alexander the Great (Greek: or Megas Alexandros; July 20 356 BC June 10 323 BC), also known as Alexander III

Re: indexing performance issue

2006-11-30 Thread Antony Bowesman
spinergywmy wrote: I have posted this question before and this time I found that it could be pdfbox problem and this pdfbox I downloaded doesn't use the log4j.jar. To index the app 2.13mb pdf file took me 17s and total time to upload a file is 18s. Re: PFDBox. I have a 2.5Mb test file that

Re: indexing performance issue

2006-11-30 Thread Antony Bowesman
Grant Ingersoll wrote: On Nov 30, 2006, at 10:54 AM, spinergywmy wrote: For my scenario will be every time the users upload the single file, I need to index that particular file. Previously was because the previous version of pdfbox integrate with log4j.jar file and I believe is the log4j.j

Re: indexing performance issue

2006-11-30 Thread Grant Ingersoll
On Nov 30, 2006, at 10:54 AM, spinergywmy wrote: Hi Grant, Thanks for the tips. I will take ur adviced and look into the link that u send to me. For my scenario will be every time the users upload the single file, I need to index that particular file. Previously was because the

Re: indexing performance issue

2006-11-30 Thread spinergywmy
f I'm wrong. Thanks regards, Wooi Meng -- View this message in context: http://www.nabble.com/indexing-performance-issue-tf2730895.html#a7621903 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --

Re: indexing performance issue

2006-11-30 Thread Grant Ingersoll
re any way or others software than pdfbox to solve the performance issue. Thanks. regards, Wooi Meng -- View this message in context: http://www.nabble.com/indexing- performance-issue-tf2730895.html#a7617155 Sent from the Lucene - Java Users mailing list a

indexing performance issue

2006-11-30 Thread spinergywmy
than pdfbox to solve the performance issue. Thanks. regards, Wooi Meng -- View this message in context: http://www.nabble.com/indexing-performance-issue-tf2730895.html#a7617155 Sent from the Lucene - Java Users mailing list archive at Nabbl

Re: Indexing Performance issue

2006-11-16 Thread Antony Bowesman
spinergywmy wrote: Hi, I having this indexing the pdf file performance issue. It took me more than 10 sec to index a pdf file about 200kb. Is it because I only have a segment file? How can I make the indexing performance better? If you're using the log4j PDFBox jar file, you must make

Re: Indexing Performance issue

2006-11-10 Thread Ioan Cocan
file performance issue. It took me more than 10 sec to index a pdf file about 200kb. Is it because I only have a segment file? How can I make the indexing performance better? Thanks regards, Wooi Meng - To unsubscribe, e

Re: Indexing Performance issue

2006-11-10 Thread Erick Erickson
wrote: > I having this indexing the pdf file performance issue. It took me more > than 10 sec to index a pdf file about 200kb. Is it because I only have a > segment file? How can I make the indexing performance better? PDFBox (which I assume you are using) can be quite slow converting lar

Re: Indexing Performance issue

2006-11-10 Thread Daniel Naber
On Friday 10 November 2006 12:18, spinergywmy wrote: >  I having this indexing the pdf file performance issue. It took me more > than 10 sec to index a pdf file about 200kb. Is it because I only have a > segment file? How can I make the indexing performance better? PDFBox (which I assum

Indexing Performance issue

2006-11-10 Thread spinergywmy
Hi, I having this indexing the pdf file performance issue. It took me more than 10 sec to index a pdf file about 200kb. Is it because I only have a segment file? How can I make the indexing performance better? Thanks regards, Wooi Meng -- View this message in context: http

Re: RAM Directory / querying Performance issue

2006-04-26 Thread Doug Cutting
Is this markedly faster than using an MMapDirectory? Copying all this data into the Java heap (as RAMDirectory does) puts a tremendous burden on the garbage collector. MMapDirectory should be nearly as fast, but keeps the index out of the Java heap. Doug z shalev wrote: I've rewritten

RAM Directory / querying Performance issue

2006-04-26 Thread zzzzz shalev
I've rewritten the RAM DIR to supprt 64 bit (still havent had time to add this to lucene, hopefully in the coming months when i have a free second) My question: i have a machine with 4 GB RAM i have a 3GB index file, i successfully load the 3GB index into memory, the

Re: Span query performance issue

2005-06-25 Thread Paul Elschot
On Saturday 25 June 2005 04:26, jian chen wrote: > Hi, > > I think Span query in general should do more work than simple Phrase > query. Phrase query, in its simplest form, should just try to find all > terms that are adjacent to each other. Meanwhile, Span query does not > necessary be adjacent t

Re: Span query performance issue

2005-06-24 Thread jian chen
Hi, I think Span query in general should do more work than simple Phrase query. Phrase query, in its simplest form, should just try to find all terms that are adjacent to each other. Meanwhile, Span query does not necessary be adjacent to each other, but, with other words in between. Therefore, I

Span query performance issue

2005-06-24 Thread yahootintin . 11533894
Hi, I'm comparing SpanNearQuery to PhraseQuery results and noticing about an 8x difference on Linux. Is a SpanNearQuery doing 8x as much work? I'm considering diving into the code if the results sounds unusual to people. But if its really doing that much more work, I won't spend time optimiz