Re: Lucene performance benchmark | search throughput

2017-01-17 Thread Michael McCandless
In your 2nd test, the number of hits was still 25K, even though you added another 1M docs to the "general" data set? If not, then the query needed to do more work and will run slower. If so, the query still does need to do more work in order to skip over the "gold" documents: that skipping (the a

Re: Lucene performance benchmark | search throughput

2017-01-17 Thread Rajnish kamboj
Hi We have modified our search query around most restrictive dataset, and as expected the search performance increases. BUT, if we increase the total data volume our search performance decreases, despite of same query and restrictive dataset. Example: Total Dataset: 3 Million 25K

Re: Lucene performance benchmark | search throughput

2017-01-06 Thread Michael McCandless
The cost() method on DocIdSetIterator is responsible for telling BooleanQuery how costly that clause is, and how cost() is implemented varies by query. For the multi-term queries, like WildcardQuery, Lucene will first visit all matched terms (during the Query.rewrite phase), and rewrite the query

Re: Lucene performance benchmark | search throughput

2017-01-05 Thread Rajnish kamboj
OK, got it One thing still I need to know (which is not clear to me) How does Lucene calculates the most restrictive clause? Correct me, if I am wrong in my understanding (in abstract): 1. During indexing, Lucene keeps information of documents count against every indexed items. 2. During sear

Re: Lucene performance benchmark | search throughput

2017-01-03 Thread Michael McCandless
When you add MUST sub-clauses to a BooleanQuery (AND to the query parsers) it can make the search run faster because Lucene will take the most restrictive clause and use that to "drive" the iteration of matching documents to the other clauses, allowing those other clauses to iterate much faster th

Re: Lucene performance benchmark | search throughput

2017-01-03 Thread Rajnish kamboj
The answer is not clear. Suppose I have following query and I want 10 records. Condition1 AND Condition2 AND Condition3 As per my understanding Lucene will first evaluate all conditions separately and then merge the Documents as per AND/OR clauses. At last it will return me 10 records. So, if I

Re: Lucene performance benchmark | search throughput

2017-01-03 Thread Michael Wilkowski
My guess: more conditions = less documents to score and sort to return. On Mon, Jan 2, 2017 at 7:23 PM, Rajnish kamboj wrote: > Hi > > Is there any Lucene performance benchmark against certain set of data? > [i.e Is there any stats for search throughput which Lucene can provide for > a certain d

Re: Lucene performance

2014-01-27 Thread Hamed Ghavamnia
Thanks, I've put some time checks on the different parts of my search, it seems like the directory opening part is taking most of the response time. I'm using MMapDirectory, but it doesn't seem to speed up my directory opening process. I've split my indexes during creation into different folders, a

Re: Lucene performance

2014-01-25 Thread Erick Erickson
You'll have to do some tuning with that kind of ingestion rate, and you're talking about a significant size cluster here. At 172M documents/day or so, you're not going to store very many days per node. Storing doesn't make much of any difference as far as search speed is concerned, the raw data is

Re: Lucene performance in 64 Bit

2012-03-01 Thread Ganesh
Thanks Li Li. Please share your experience in 64 bit. How big your indexes are? Regards Ganesh - Original Message - From: "Li Li" To: Sent: Thursday, March 01, 2012 3:03 PM Subject: Re: Lucene performance in 64 Bit >I think many users of lucene use large memory

Re: Lucene performance in 64 Bit

2012-03-01 Thread Li Li
I think many users of lucene use large memory because 32bit system's memory is too limited(windows 1.5GB, Linux 2-3GB). the only noticable thing is * Compressed* *oops* . some says it's useful, some not. you should give it a try. On Thu, Mar 1, 2012 at 4:59 PM, Ganesh wrote: > Hello all, > > Is

Re: Lucene performance: is search time linear to the index size?

2009-06-19 Thread Joel Halbert
gt; >> Sent: Thursday, June 18, 2009 12:44 AM > >> To: java-user@lucene.apache.org > >> Subject: Re: Lucene performance: is search time linear to the > >> index size? > >> > >> Opening a searcher and doing the first query incurs a > >> significan

RE: Lucene performance: is search time linear to the index size?

2009-06-18 Thread Teruhiko Kurosaka
> From: Jay Booth [mailto:jbo...@wgen.net] > Are you fetching all of the results for your search? No, I'm not doing anything on the search results. This is essentially what I do: searcher = new IndexSearcher(IndexReader.open(indexFileDir)); query = new TermQuery(new Term(fieldNam

Re: Lucene performance: is search time linear to the index size?

2009-06-18 Thread Yonik Seeley
ous clauses of the query. -Yonik http://www.lucidimagination.com > -kuro > >> -Original Message- >> From: Erick Erickson [mailto:erickerick...@gmail.com] >> Sent: Thursday, June 18, 2009 12:44 AM >> To: java-user@lucene.apache.org >> Subject: Re: Lucene p

RE: Lucene performance: is search time linear to the index size?

2009-06-18 Thread Jay Booth
disk, not the search time. -Original Message- From: Teruhiko Kurosaka [mailto:k...@basistech.com] Sent: Thursday, June 18, 2009 2:55 PM To: java-user@lucene.apache.org Subject: RE: Lucene performance: is search time linear to the index size? Erik, The way I test this program is by is

RE: Lucene performance: is search time linear to the index size?

2009-06-18 Thread Teruhiko Kurosaka
ne Document that can matches with a query, the search time remains constant no matter how large the index is. -kuro > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Thursday, June 18, 2009 12:44 AM > To: java-user@lucene.apache.org >

Re: Lucene performance: is search time linear to the index size?

2009-06-18 Thread Erick Erickson
Opening a searcher and doing the first query incurs a significant amount of overhead, cache loading, etc. Inferring search times relative to index size with a program like you describe is unreliable. Try firing a few queries at the index without measuring, *then* measure the time it takes for subs

RE: Lucene performance: is search time linear to the index size?

2009-06-17 Thread Teruhiko Kurosaka
I've written a test program that uses the simplest form of search, TermQuery and measure the time it takes to search a term in a field on indices of various sizes. The result is a very linear growth of search time vs the index size in terms of # of Documents, not # of unique terms in that field.

Re: Lucene performance: is search time linear to the index size?

2009-06-17 Thread Peter Keegan
:erickerick...@gmail.com] > > Sent: Wednesday, June 17, 2009 9:09 AM > > To: java-user@lucene.apache.org > > Subject: Re: Lucene performance: is search time linear to the > > index size? > > > > Are you measuring search time *only* or are you measuring > >

RE: Lucene performance: is search time linear to the index size?

2009-06-17 Thread Teruhiko Kurosaka
June 17, 2009 9:09 AM > To: java-user@lucene.apache.org > Subject: Re: Lucene performance: is search time linear to the > index size? > > Are you measuring search time *only* or are you measuring > total response time including assembling whatever you > assemble? If y

Re: Lucene performance: is search time linear to the index size?

2009-06-17 Thread Erick Erickson
Are you measuring search time *only* or are you measuring total response time including assembling whatever you assemble? If you're measuring total response time, everything from network latency to what you're doing with each hit may affect response time. This is especially true if you're iteratin

Re: Lucene performance: is search time linear to the index size?

2009-06-17 Thread Ian Lea
It depends on lots of things, but the time to execute a search would not typically grow linearly with the number of documents. But the time to retrieve data from all the hits might, if the number of hits is growing in line with the number of documents. Are you doing that by any chance, as opposed

Re: Lucene Performance issue

2009-01-21 Thread Anshul jain
@Erick: Yes I changed the default field, it is "bagofwords" now. @Ian: Yes both indexes were optimized, and I didn't do any deletions. version 2.4.0 I'll repeat the experiment, just be sure. Mean while, do you have any document on Lucene fields? what I need to know is how lucene is storing field

Re: Lucene Performance issue

2009-01-21 Thread Ian Lea
> ... > I can for sure say that multiple copies are not index. But the number of > fields in which text is divided are many. Can that be a reason? Not for that amount of difference. You may be sure that you are not indexing multiple copies, but I'm not. Convince me - create 2 new indexes via the

Re: Lucene Performance issue

2009-01-21 Thread Erick Erickson
Note that your two queries are different unless you've changed the default operator. Also, your bagOfWords query is searching across your default field for the second two terms. Your bagOfWords is really something like bagOfWords:Alexander OR :history OR :Macedon. Best Erick On Wed, Jan 21, 20

Re: Lucene Performance issue

2009-01-21 Thread Erick Erickson
I agree with Ian that these times sound way too high. I'd also ask whether you fire a few warmup searches at your server before measuring the increased time, you might just be seeing the cache being populated. Best Erick On Wed, Jan 21, 2009 at 10:42 AM, Ian Lea wrote: > Hi > > > Space: 700Mb v

Re: Lucene Performance issue

2009-01-21 Thread Anshul jain
Hi, thanks for the reply. For the document, in my last mail.. multifieldQuery: name: Alexander AND domain: history AND first_sentence: Macedon Single field query: bagOfWords: Alexander history Macedon I can for sure say that multiple copies are not index. But the number of fields in which text

Re: Lucene Performance issue

2009-01-21 Thread Ian Lea
Hi Space: 700Mb vs 4.5Gb sounds way too big a difference. Are you sure you aren't loading multiple copies of the data or something like that? Queries: a 20 times slowdown for a multi field query also sounds way too big. What do the simple and multi field queries look like? -- Ian. On Wed,

Re: Lucene performance issues..

2008-07-28 Thread Michael McCandless
:59pm To: java-user@lucene.apache.org Subject: Re: Lucene performance issues.. On Sonntag, 27. Juli 2008, Mazhar Lateef wrote: We have also tried upgrading the lucene version to 2.3 in hope to improve performance but the results were quite the opposite. but from my research on the internet the Lucene

Re: Lucene performance issues..

2008-07-28 Thread Toke Eskildsen
On Sun, 2008-07-27 at 21:38 +0100, Mazhar Lateef wrote: > * email searching > o We are creating very large indexes for emails we are > processing, the size is upto +150GB for indexes only (not > including data content), this we thought would improve > search

Re: Lucene performance issues..

2008-07-28 Thread ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
o change for a while. > > > -Original Message- > From: "Daniel Naber" <[EMAIL PROTECTED]> > Sent: Sunday, July 27, 2008 4:59pm > To: java-user@lucene.apache.org > Subject: Re: Lucene performance issues.. > > On Sonntag, 27. Juli 2008, Mazhar Lateef wrote:

Re: Lucene performance issues..

2008-07-27 Thread Stu Hood
nt: Sunday, July 27, 2008 4:59pm To: java-user@lucene.apache.org Subject: Re: Lucene performance issues.. On Sonntag, 27. Juli 2008, Mazhar Lateef wrote: > We have also tried upgrading the lucene version to 2.3 in hope to > improve performance but the results were quite the opposite. but fro

Re: Lucene performance issues..

2008-07-27 Thread Daniel Naber
On Sonntag, 27. Juli 2008, Mazhar Lateef wrote: > We have also tried upgrading the lucene version to 2.3 in hope to > improve performance but the results were quite the opposite. but from my > research on the internet the Lucene version 2.3 is much faster and > better so why are we seeing such inc

Re: Lucene performance: benchmarktemplate.xml

2008-04-18 Thread Glen Newton
HI Anshum, A reasonable question. Answer: 64 bit architecture running 64 bit Java VM. It is great! :-) > Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_02-b05, mixed mode) > OS Version: Linux OpenSUSE 10.2 (64-bit X86-64) If you have any other questions, please let me know. :-) -Glen On 18/0

Re: Lucene performance: benchmarktemplate.xml

2008-04-17 Thread Anshum
Hi Glenn, I am not too clear about it, but isn't there a limit to the memory consumption specified for the JVM? The limit being 1.3Gigs of resident and 2 Gigs of memory in all? You just mentioned the Memory consumption: -Xms4000m -Xmx6000m. Could someone please help me with the same. -- Anshum O

Re: Lucene performance: benchmarktemplate.xml

2008-04-16 Thread Glen Newton
On 16/04/2008, Michael McCandless <[EMAIL PROTECTED]> wrote: > These are great results! Thanks for posting. Thanks! > > I'd be curious if you'd get better indexing throughput by using a single > IndexWriter, fed by all 8 indexing threads, with an 8X bigger RAM buffer, > instead of 8 IndexWriter

Re: Lucene performance: benchmarktemplate.xml

2008-04-16 Thread Michael McCandless
These are great results! Thanks for posting. I'd be curious if you'd get better indexing throughput by using a single IndexWriter, fed by all 8 indexing threads, with an 8X bigger RAM buffer, instead of 8 IndexWriters that merge in the end. How long does that final merge take now? Also, 6

Re: Lucene performance: benchmarktemplate.xml

2008-04-16 Thread Glen Newton
Cass, Thanks for converting it. I've posted it to my blog: http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html Sorry for the XML tags: I guess I followed the instructions on the Lucene performance benchmarks page to literally ("Post these figures to the lucene-user maili

Re: Lucene performance: benchmarktemplate.xml

2008-04-15 Thread Cass Costello
I just did that so I could read it. :) I'll leave it up until Glen resends or posts it somewhere... http://www.casscostello.com/?page_id=28 On Tue, Apr 15, 2008 at 5:18 PM, Ian Holsman <[EMAIL PROTECTED]> wrote: > Hi Glen. > can you resend this in plain text? > or put the HTML up on a server s

Re: Lucene performance: benchmarktemplate.xml

2008-04-15 Thread Ian Holsman
Hi Glen. can you resend this in plain text? or put the HTML up on a server somewhere and point to it with a brief summary in the post? I'd love to look and read it, all those tags are making me go blind. Glen Newton wrote: Hardware Environment Dedicated machine for indexing: yes CPU: D

Re: Lucene Performance

2008-01-28 Thread Thibaut Britz
Thanks for your answer, I will look into this in more detail. Paul Elschot wrote: > > On Friday 18 January 2008 17:52:27 Thibaut Britz wrote: >> >> Hi, >> > ... >> >> Another thing I noticed is that we append a lot of queries, so we have a >> lot >> of duplicate phrases like (A and B or C)

Re: Lucene Performance

2008-01-19 Thread Paul Elschot
On Friday 18 January 2008 17:52:27 Thibaut Britz wrote: > > Hi, > ... > > Another thing I noticed is that we append a lot of queries, so we have a lot > of duplicate phrases like (A and B or C) and ... and (A and B or C) (more > nested than that). Is lucene doing any internal query optimization

Re: lucene performance issues

2008-01-05 Thread Andrew Huntwork
Your grinder output seems to indicate clearly that your bottleneck is in your database code, not in lucene. It seems that the threads are all blocked trying to get a connection from a connection pool. Maybe you're leaking connections, or maybe you need to increase the size of the pool. On 1/3/

Re: lucene performance issues

2008-01-04 Thread Otis Gospodnetic
Oscar, Here are some ideas: - Optimize your index (won't help with synchronization, but will help with search performance) - Consider using the non-compound index format (cca 10% faster) - Wait for 2.3 (soon!) that has some performance improvements (though this synchronization bit is still there

Re: Lucene performance using a solid state disk (SSD)

2007-07-28 Thread Otis Gospodnetic
Kent, I have not seen anyone do this, but I know Kevin Burton of TailRank (BCCed) has been drooling over the same idea (check his blog(s)). :) Otis -- Lucene Consulting -- http://lucene-consulting.com/ - Original Message From: Kent Fitch <[EMAIL PROTECTED]> To: java-user@lucene.apache

Re: Lucene Performance Issues

2006-03-28 Thread thomasg
Thanks v. much for your thoughts, a lot to think about. I'm currently doing some benchmark tests on typical usage scenarios with lucene. I'm actually using lucene through its integration with Jackrabbit dms so may not be easy/possible to use a different search engine anyway. Of course I'd rather b

Re: Lucene Performance Issues

2006-03-28 Thread Otis Gospodnetic
Hi Thomas, Sound like FUD to me. No concrete numbers, and the benchmark they mention eh, haven't we all seen "funny" benchmarks before? Lucene is used in many large operations (e.g. Technorati, Simpy) that involve a LOT of indexing and searching, large indices, etc. I suggest you try bot

Re: Lucene Performance Issues

2006-03-28 Thread Doug Cutting
thomasg wrote: Hi, we are currently intending to implement a document storage / search tool using Jackrabbit and Lucene. We have been approached by a commercial search and indexing organisation called ISYS who are suggesting the following problems with using Lucene. We do have a requirement to st

Re: Lucene Performance Issues

2006-03-28 Thread Eric Jain
thomasg wrote: 1) By default, Lucene only indexes the first 10,000 words from each document. When increasing this default out-of-memory errors can occur. This implies that documents, or large sections thereof, are loaded into memory. ISYS has a very small memory footprint which is not affected by

Re: Lucene performance question

2006-03-09 Thread DanielFeinstein
I'm using the following java options: JAVA_OPTS='-Xmx1524m -Xms1524m -Djava.awt.headless=true' --- Grant Ingersoll <[EMAIL PROTECTED]> wrote: > What is your Java max heap size set to? This is the > -Xmx Java option. > > Daniel Feinstein wrote: > > Hi, > > > > My lucene index is not big (about

Re: Lucene performance question

2006-03-09 Thread Grant Ingersoll
What is your Java max heap size set to? This is the -Xmx Java option. Daniel Feinstein wrote: Hi, My lucene index is not big (about 150M). My computer has 2G RAM but for some reason when I'm trying to store my index using org.apache.lucene.store.RAMDirectory it fails with java out of memory

Re: Lucene performance bottlenecks

2005-12-12 Thread Chris Hostetter
: Oh, BTW: I just found the DisjunctionMaxQuery class, recently added it : seems. Do you think this query structure could benefit from using it : instead of the BooleanQuery? DisjunctionMaxQuery kicks ass (in my opinion), and It certainly seems like (from your query structure) it's something you

Re: Lucene performance bottlenecks

2005-12-12 Thread Andrzej Bialecki
Paul Elschot wrote: There is one indexing parameter that might help performance for BooleanScorer2, it is the skip interval in Lucene's TermInfosWriter. The current value is 16, and there was a question about it on 16 Oct 2005 on java-dev with title "skipInterval". I don't know how the value of

Re: Lucene performance bottlenecks

2005-12-11 Thread Paul Elschot
On Wednesday 07 December 2005 10:51, Andrzej Bialecki wrote: > Paul Elschot wrote: > >On Saturday 03 December 2005 14:09, Andrzej Bialecki wrote: > >>Paul Elschot wrote: > >> ... > >>>This is one of the cases in which BooleanScorer2 can be faster > >>>than the 1.4 BooleanScorer because the 1.4 Boo

RE: Lucene performance bottlenecks

2005-12-08 Thread Dalton, Jeffery
Andrzej, I think you did a great job elucidating my thoughts as well. I heartily concur with everything you said. Andrzej Bialecki Wrote: > Hmm... Please define what "adequate" means. :-) IMHO, > "adequate" is when for any query the response time is well > below 1 second. Otherwise the serv

Re: Lucene performance bottlenecks

2005-12-08 Thread Andrzej Bialecki
(Moving the discussion to nutch-dev, please drop the cc: when responding) Doug Cutting wrote: Andrzej Bialecki wrote: It's nice to have these couple percent... however, it doesn't solve the main problem; I need 50 or more percent increase... :-) and I suspect this can be achieved only by som

Re: Lucene performance bottlenecks

2005-12-07 Thread Doug Cutting
Andrzej Bialecki wrote: It's nice to have these couple percent... however, it doesn't solve the main problem; I need 50 or more percent increase... :-) and I suspect this can be achieved only by some radical changes in the way Nutch uses Lucene. It seems the default query structure is too compl

Re: Lucene performance bottlenecks

2005-12-07 Thread Andrzej Bialecki
Yonik Seeley wrote: if (b>0) return b; Doing an 'and' of two bytes and checking if the result is 0 probably requires masking operations on >8 bit processors... Sometimes you can get a peek into how a JVM would optimize things by looking at the asm output of the code from a C compiler. Bot

Re: Lucene performance bottlenecks

2005-12-07 Thread Doug Cutting
Paul Elschot wrote: Querying the host field like this in a web page index can be dangerous business. For example when term1 is "wikipedia" and term2 is "org", the query will match at least all pages from wikipedia.org. Note that if you search for wikipedia.org in Nutch this is interpreted as a

Re: Lucene performance bottlenecks

2005-12-07 Thread Yonik Seeley
> if (b>0) return b; > Doing an 'and' of two bytes and checking if the result is 0 probably > requires masking operations on >8 bit processors... Sometimes you can get a peek into how a JVM would optimize things by looking at the asm output of the code from a C compiler. Both (b>=0) and ((b&0x80)!

Re: Lucene performance bottlenecks

2005-12-07 Thread Yonik Seeley
On 12/7/05, Vanlerberghe, Luc <[EMAIL PROTECTED]> wrote: > Since 'byte' is signed in Java, can't the first test be simply written > as > if (b>0) return b; > Doing an 'and' of two bytes and checking if the result is 0 probably > requires masking operations on >8 bit processors... Yep, that was my

RE: Lucene performance bottlenecks

2005-12-07 Thread Vanlerberghe, Luc
all operators use int's... Luc -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: woensdag 7 december 2005 16:11 To: java-user@lucene.apache.org Subject: Re: Lucene performance bottlenecks I checked out readVInt() to see if I could optimize it any... For a random

Re: Lucene performance bottlenecks

2005-12-07 Thread Yonik Seeley
I checked out readVInt() to see if I could optimize it any... For a random distribution of integers <200 I was able to speed it up a little bit, but nothing to write home about: old newpercent Java14-client : 13547 12468 8% Java14-server: 6047 5266 14% Java1

Re: Lucene performance bottlenecks

2005-12-07 Thread Andrzej Bialecki
Paul Elschot wrote: On Saturday 03 December 2005 14:09, Andrzej Bialecki wrote: Paul Elschot wrote: In somewhat more readable layout: +(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 content:term2 title:term2^1.5 host:

Re: Lucene performance bottlenecks

2005-12-03 Thread Paul Elschot
On Saturday 03 December 2005 14:09, Andrzej Bialecki wrote: > Paul Elschot wrote: > > >In somewhat more readable layout: > > > >+(url:term1^4.0 anchor:term1^2.0 content:term1 > > title:term1^1.5 host:term1^2.0) > >+(url:term2^4.0 anchor:term2^2.0 content:term2 > > title:term2^1.5 host:term2^

Re: Lucene performance bottlenecks

2005-12-03 Thread Andrzej Bialecki
Paul Elschot wrote: In somewhat more readable layout: +(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 content:term2 title:term2^1.5 host:term2^2.0) url:"term1 term2"~2147483647^4.0 anchor:"term1 term2"~4^2.0 content:"term1 t

Re: Lucene performance bottlenecks

2005-12-03 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: For a simple TermQuery, if the DF(term) is above 10%, the response time from IndexSearcher.search() is around 400ms (repeatable, after warm-up). For such complex phrase queries the response time is around 1 sec or more (again, after warm-up). Ar

Re: Lucene performance bottlenecks

2005-12-02 Thread Doug Cutting
Andrzej Bialecki wrote: For a simple TermQuery, if the DF(term) is above 10%, the response time from IndexSearcher.search() is around 400ms (repeatable, after warm-up). For such complex phrase queries the response time is around 1 sec or more (again, after warm-up). Are you specifying -server

Re: Lucene performance bottlenecks

2005-12-02 Thread Paul Elschot
Andrzej, On Friday 02 December 2005 12:55, Andrzej Bialecki wrote: > Hi, > > I'm doing some performance profiling of a Nutch installation, working > with relatively large individual indexes (10 mln docs), and I'm puzzled > with the results. > > Here's the listing of the index: > -rw-r--r-- 1