In your 2nd test, the number of hits was still 25K, even though you
added another 1M docs to the "general" data set? If not, then the
query needed to do more work and will run slower.
If so, the query still does need to do more work in order to skip over
the "gold" documents: that skipping (the a
Hi
We have modified our search query around most restrictive dataset, and as
expected the search performance increases.
BUT, if we increase the total data volume our search performance decreases,
despite of same query and restrictive dataset.
Example:
Total Dataset: 3 Million 25K
The cost() method on DocIdSetIterator is responsible for telling
BooleanQuery how costly that clause is, and how cost() is implemented
varies by query.
For the multi-term queries, like WildcardQuery, Lucene will first
visit all matched terms (during the Query.rewrite phase), and rewrite
the query
OK, got it
One thing still I need to know (which is not clear to me)
How does Lucene calculates the most restrictive clause?
Correct me, if I am wrong in my understanding (in abstract):
1. During indexing, Lucene keeps information of documents count against
every indexed items.
2. During sear
When you add MUST sub-clauses to a BooleanQuery (AND to the query
parsers) it can make the search run faster because Lucene will take
the most restrictive clause and use that to "drive" the iteration of
matching documents to the other clauses, allowing those other clauses
to iterate much faster th
The answer is not clear.
Suppose I have following query and I want 10 records.
Condition1 AND Condition2 AND Condition3
As per my understanding Lucene will first evaluate all conditions
separately and then merge the Documents as per AND/OR clauses.
At last it will return me 10 records.
So, if I
My guess: more conditions = less documents to score and sort to return.
On Mon, Jan 2, 2017 at 7:23 PM, Rajnish kamboj
wrote:
> Hi
>
> Is there any Lucene performance benchmark against certain set of data?
> [i.e Is there any stats for search throughput which Lucene can provide for
> a certain d
Thanks, I've put some time checks on the different parts of my search, it
seems like the directory opening part is taking most of the response time.
I'm using MMapDirectory, but it doesn't seem to speed up my directory
opening process.
I've split my indexes during creation into different folders, a
You'll have to do some tuning with that kind of ingestion rate, and
you're talking about a significant size cluster here. At 172M
documents/day or so, you're not going to store very many days per
node.
Storing doesn't make much of any difference as far as search
speed is concerned, the raw data is
Thanks Li Li.
Please share your experience in 64 bit. How big your indexes are?
Regards
Ganesh
- Original Message -
From: "Li Li"
To:
Sent: Thursday, March 01, 2012 3:03 PM
Subject: Re: Lucene performance in 64 Bit
>I think many users of lucene use large memory
I think many users of lucene use large memory because 32bit system's memory
is too limited(windows 1.5GB, Linux 2-3GB). the only noticable thing is *
Compressed* *oops* . some says it's useful, some not. you should give it a
try.
On Thu, Mar 1, 2012 at 4:59 PM, Ganesh wrote:
> Hello all,
>
> Is
gt; >> Sent: Thursday, June 18, 2009 12:44 AM
> >> To: java-user@lucene.apache.org
> >> Subject: Re: Lucene performance: is search time linear to the
> >> index size?
> >>
> >> Opening a searcher and doing the first query incurs a
> >> significan
> From: Jay Booth [mailto:jbo...@wgen.net]
> Are you fetching all of the results for your search?
No, I'm not doing anything on the search results.
This is essentially what I do:
searcher = new IndexSearcher(IndexReader.open(indexFileDir));
query = new TermQuery(new Term(fieldNam
ous clauses
of the query.
-Yonik
http://www.lucidimagination.com
> -kuro
>
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: Thursday, June 18, 2009 12:44 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene p
disk, not the search time.
-Original Message-
From: Teruhiko Kurosaka [mailto:k...@basistech.com]
Sent: Thursday, June 18, 2009 2:55 PM
To: java-user@lucene.apache.org
Subject: RE: Lucene performance: is search time linear to the index
size?
Erik,
The way I test this program is by is
ne Document that can
matches with a query, the search time remains constant no
matter how large the index is.
-kuro
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Thursday, June 18, 2009 12:44 AM
> To: java-user@lucene.apache.org
>
Opening a searcher and doing the first query incurs a significant amount of
overhead, cache loading, etc. Inferring search times relative to index size
with a program like you describe is unreliable.
Try firing a few queries at the index without measuring, *then* measure the
time it takes for subs
I've written a test program that uses the simplest form of search,
TermQuery and measure the time it takes to search a term in a field
on indices of various sizes.
The result is a very linear growth of search time vs the index size in
terms of # of Documents, not # of unique terms in that field.
:erickerick...@gmail.com]
> > Sent: Wednesday, June 17, 2009 9:09 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Lucene performance: is search time linear to the
> > index size?
> >
> > Are you measuring search time *only* or are you measuring
> >
June 17, 2009 9:09 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene performance: is search time linear to the
> index size?
>
> Are you measuring search time *only* or are you measuring
> total response time including assembling whatever you
> assemble? If y
Are you measuring search time *only* or are you measuring total response
time
including assembling whatever you assemble? If you're measuring total
response
time, everything from network latency to what you're doing with each hit may
affect response time.
This is especially true if you're iteratin
It depends on lots of things, but the time to execute a search would not
typically grow linearly with the number of documents. But the time to
retrieve data from all the hits might, if the number of hits is growing in
line with the number of documents. Are you doing that by any chance, as
opposed
@Erick: Yes I changed the default field, it is "bagofwords" now.
@Ian: Yes both indexes were optimized, and I didn't do any deletions.
version 2.4.0
I'll repeat the experiment, just be sure.
Mean while, do you have any document on Lucene fields? what I need to know
is how lucene is storing field
> ...
> I can for sure say that multiple copies are not index. But the number of
> fields in which text is divided are many. Can that be a reason?
Not for that amount of difference. You may be sure that you are not
indexing multiple copies, but I'm not. Convince me - create 2 new
indexes via the
Note that your two queries are different unless you've
changed the default operator.
Also, your bagOfWords query is searching across your
default field for the second two terms.
Your bagOfWords is really something like
bagOfWords:Alexander OR :history OR :Macedon.
Best
Erick
On Wed, Jan 21, 20
I agree with Ian that these times sound way too high. I'd
also ask whether you fire a few warmup searches at your
server before measuring the increased time, you might
just be seeing the cache being populated.
Best
Erick
On Wed, Jan 21, 2009 at 10:42 AM, Ian Lea wrote:
> Hi
>
>
> Space: 700Mb v
Hi,
thanks for the reply.
For the document, in my last mail..
multifieldQuery:
name: Alexander AND domain: history AND first_sentence: Macedon
Single field query:
bagOfWords: Alexander history Macedon
I can for sure say that multiple copies are not index. But the number of
fields in which text
Hi
Space: 700Mb vs 4.5Gb sounds way too big a difference. Are you sure
you aren't loading multiple copies of the data or something like that?
Queries: a 20 times slowdown for a multi field query also sounds way
too big. What do the simple and multi field queries look like?
--
Ian.
On Wed,
:59pm
To: java-user@lucene.apache.org
Subject: Re: Lucene performance issues..
On Sonntag, 27. Juli 2008, Mazhar Lateef wrote:
We have also tried upgrading the lucene version to 2.3 in hope to
improve performance but the results were quite the opposite. but
from my
research on the internet the Lucene
On Sun, 2008-07-27 at 21:38 +0100, Mazhar Lateef wrote:
> * email searching
> o We are creating very large indexes for emails we are
> processing, the size is upto +150GB for indexes only (not
> including data content), this we thought would improve
> search
o change for a while.
>
>
> -Original Message-
> From: "Daniel Naber" <[EMAIL PROTECTED]>
> Sent: Sunday, July 27, 2008 4:59pm
> To: java-user@lucene.apache.org
> Subject: Re: Lucene performance issues..
>
> On Sonntag, 27. Juli 2008, Mazhar Lateef wrote:
nt: Sunday, July 27, 2008 4:59pm
To: java-user@lucene.apache.org
Subject: Re: Lucene performance issues..
On Sonntag, 27. Juli 2008, Mazhar Lateef wrote:
> We have also tried upgrading the lucene version to 2.3 in hope to
> improve performance but the results were quite the opposite. but fro
On Sonntag, 27. Juli 2008, Mazhar Lateef wrote:
> We have also tried upgrading the lucene version to 2.3 in hope to
> improve performance but the results were quite the opposite. but from my
> research on the internet the Lucene version 2.3 is much faster and
> better so why are we seeing such inc
HI Anshum,
A reasonable question. Answer: 64 bit architecture running 64 bit Java
VM. It is great! :-)
> Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_02-b05, mixed mode)
> OS Version: Linux OpenSUSE 10.2 (64-bit X86-64)
If you have any other questions, please let me know. :-)
-Glen
On 18/0
Hi Glenn,
I am not too clear about it, but isn't there a limit to the memory
consumption specified for the JVM? The limit being 1.3Gigs of resident
and 2 Gigs of memory in all? You just mentioned the Memory consumption:
-Xms4000m -Xmx6000m.
Could someone please help me with the same.
--
Anshum
O
On 16/04/2008, Michael McCandless <[EMAIL PROTECTED]> wrote:
> These are great results! Thanks for posting.
Thanks!
>
> I'd be curious if you'd get better indexing throughput by using a single
> IndexWriter, fed by all 8 indexing threads, with an 8X bigger RAM buffer,
> instead of 8 IndexWriter
These are great results! Thanks for posting.
I'd be curious if you'd get better indexing throughput by using a
single IndexWriter, fed by all 8 indexing threads, with an 8X bigger
RAM buffer, instead of 8 IndexWriters that merge in the end.
How long does that final merge take now?
Also, 6
Cass,
Thanks for converting it. I've posted it to my blog:
http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html
Sorry for the XML tags: I guess I followed the instructions on the
Lucene performance benchmarks page to literally ("Post these figures
to the lucene-user maili
I just did that so I could read it. :) I'll leave it up until Glen resends
or posts it somewhere...
http://www.casscostello.com/?page_id=28
On Tue, Apr 15, 2008 at 5:18 PM, Ian Holsman <[EMAIL PROTECTED]> wrote:
> Hi Glen.
> can you resend this in plain text?
> or put the HTML up on a server s
Hi Glen.
can you resend this in plain text?
or put the HTML up on a server somewhere and point to it with a brief
summary in the post?
I'd love to look and read it, all those tags are making me go blind.
Glen Newton wrote:
Hardware Environment
Dedicated machine for indexing: yes
CPU: D
Thanks for your answer,
I will look into this in more detail.
Paul Elschot wrote:
>
> On Friday 18 January 2008 17:52:27 Thibaut Britz wrote:
>>
>> Hi,
>>
> ...
>>
>> Another thing I noticed is that we append a lot of queries, so we have a
>> lot
>> of duplicate phrases like (A and B or C)
On Friday 18 January 2008 17:52:27 Thibaut Britz wrote:
>
> Hi,
>
...
>
> Another thing I noticed is that we append a lot of queries, so we have a lot
> of duplicate phrases like (A and B or C) and ... and (A and B or C) (more
> nested than that). Is lucene doing any internal query optimization
Your grinder output seems to indicate clearly that your bottleneck is
in your database code, not in lucene. It seems that the threads are
all blocked trying to get a connection from a connection pool. Maybe
you're leaking connections, or maybe you need to increase the size of
the pool.
On 1/3/
Oscar,
Here are some ideas:
- Optimize your index (won't help with synchronization, but will help with
search performance)
- Consider using the non-compound index format (cca 10% faster)
- Wait for 2.3 (soon!) that has some performance improvements (though this
synchronization bit is still there
Kent,
I have not seen anyone do this, but I know Kevin Burton of TailRank (BCCed) has
been drooling over the same idea (check his blog(s)). :)
Otis
--
Lucene Consulting -- http://lucene-consulting.com/
- Original Message
From: Kent Fitch <[EMAIL PROTECTED]>
To: java-user@lucene.apache
Thanks v. much for your thoughts, a lot to think about. I'm currently doing
some benchmark tests on typical usage scenarios with lucene. I'm actually
using lucene through its integration with Jackrabbit dms so may not be
easy/possible to use a different search engine anyway. Of course I'd rather
b
Hi Thomas,
Sound like FUD to me. No concrete numbers, and the benchmark they mention
eh, haven't we all seen "funny" benchmarks before? Lucene is used in many
large operations (e.g. Technorati, Simpy) that involve a LOT of indexing and
searching, large indices, etc. I suggest you try bot
thomasg wrote:
Hi, we are currently intending to implement a document storage / search tool
using Jackrabbit and Lucene. We have been approached by a commercial search
and indexing organisation called ISYS who are suggesting the following
problems with using Lucene. We do have a requirement to st
thomasg wrote:
1) By default, Lucene only indexes the first 10,000 words from each
document. When increasing this default out-of-memory errors can occur. This
implies that documents, or large sections thereof, are loaded into memory.
ISYS has a very small memory footprint which is not affected by
I'm using the following java options:
JAVA_OPTS='-Xmx1524m -Xms1524m
-Djava.awt.headless=true'
--- Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> What is your Java max heap size set to? This is the
> -Xmx Java option.
>
> Daniel Feinstein wrote:
> > Hi,
> >
> > My lucene index is not big (about
What is your Java max heap size set to? This is the -Xmx Java option.
Daniel Feinstein wrote:
Hi,
My lucene index is not big (about 150M). My computer has 2G RAM but for some reason when I'm trying to store my index
using org.apache.lucene.store.RAMDirectory it fails with java out of memory
: Oh, BTW: I just found the DisjunctionMaxQuery class, recently added it
: seems. Do you think this query structure could benefit from using it
: instead of the BooleanQuery?
DisjunctionMaxQuery kicks ass (in my opinion), and It certainly seems like
(from your query structure) it's something you
Paul Elschot wrote:
There is one indexing parameter that might help performance
for BooleanScorer2, it is the skip interval in Lucene's TermInfosWriter.
The current value is 16, and there was a question about it
on 16 Oct 2005 on java-dev with title "skipInterval".
I don't know how the value of
On Wednesday 07 December 2005 10:51, Andrzej Bialecki wrote:
> Paul Elschot wrote:
> >On Saturday 03 December 2005 14:09, Andrzej Bialecki wrote:
> >>Paul Elschot wrote:
> >>
...
> >>>This is one of the cases in which BooleanScorer2 can be faster
> >>>than the 1.4 BooleanScorer because the 1.4 Boo
Andrzej, I think you did a great job elucidating my thoughts as well. I
heartily concur with everything you said.
Andrzej Bialecki Wrote:
> Hmm... Please define what "adequate" means. :-) IMHO,
> "adequate" is when for any query the response time is well
> below 1 second. Otherwise the serv
(Moving the discussion to nutch-dev, please drop the cc: when responding)
Doug Cutting wrote:
Andrzej Bialecki wrote:
It's nice to have these couple percent... however, it doesn't solve
the main problem; I need 50 or more percent increase... :-) and I
suspect this can be achieved only by som
Andrzej Bialecki wrote:
It's nice to have these couple percent... however, it doesn't solve the
main problem; I need 50 or more percent increase... :-) and I suspect
this can be achieved only by some radical changes in the way Nutch uses
Lucene. It seems the default query structure is too compl
Yonik Seeley wrote:
if (b>0) return b;
Doing an 'and' of two bytes and checking if the result is 0 probably
requires masking operations on >8 bit processors...
Sometimes you can get a peek into how a JVM would optimize things by
looking at the asm output of the code from a C compiler.
Bot
Paul Elschot wrote:
Querying the host field like this in a web page index can be dangerous
business. For example when term1 is "wikipedia" and term2 is "org",
the query will match at least all pages from wikipedia.org.
Note that if you search for wikipedia.org in Nutch this is interpreted
as a
> if (b>0) return b;
> Doing an 'and' of two bytes and checking if the result is 0 probably
> requires masking operations on >8 bit processors...
Sometimes you can get a peek into how a JVM would optimize things by
looking at the asm output of the code from a C compiler.
Both (b>=0) and ((b&0x80)!
On 12/7/05, Vanlerberghe, Luc <[EMAIL PROTECTED]> wrote:
> Since 'byte' is signed in Java, can't the first test be simply written
> as
> if (b>0) return b;
> Doing an 'and' of two bytes and checking if the result is 0 probably
> requires masking operations on >8 bit processors...
Yep, that was my
all operators use int's...
Luc
-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: woensdag 7 december 2005 16:11
To: java-user@lucene.apache.org
Subject: Re: Lucene performance bottlenecks
I checked out readVInt() to see if I could optimize it any...
For a random
I checked out readVInt() to see if I could optimize it any...
For a random distribution of integers <200 I was able to speed it up a
little bit, but nothing to write home about:
old newpercent
Java14-client : 13547 12468 8%
Java14-server: 6047 5266 14%
Java1
Paul Elschot wrote:
On Saturday 03 December 2005 14:09, Andrzej Bialecki wrote:
Paul Elschot wrote:
In somewhat more readable layout:
+(url:term1^4.0 anchor:term1^2.0 content:term1
title:term1^1.5 host:term1^2.0)
+(url:term2^4.0 anchor:term2^2.0 content:term2
title:term2^1.5 host:
On Saturday 03 December 2005 14:09, Andrzej Bialecki wrote:
> Paul Elschot wrote:
>
> >In somewhat more readable layout:
> >
> >+(url:term1^4.0 anchor:term1^2.0 content:term1
> > title:term1^1.5 host:term1^2.0)
> >+(url:term2^4.0 anchor:term2^2.0 content:term2
> > title:term2^1.5 host:term2^
Paul Elschot wrote:
In somewhat more readable layout:
+(url:term1^4.0 anchor:term1^2.0 content:term1
title:term1^1.5 host:term1^2.0)
+(url:term2^4.0 anchor:term2^2.0 content:term2
title:term2^1.5 host:term2^2.0)
url:"term1 term2"~2147483647^4.0
anchor:"term1 term2"~4^2.0
content:"term1 t
Doug Cutting wrote:
Andrzej Bialecki wrote:
For a simple TermQuery, if the DF(term) is above 10%, the response
time from IndexSearcher.search() is around 400ms (repeatable, after
warm-up). For such complex phrase queries the response time is around
1 sec or more (again, after warm-up).
Ar
Andrzej Bialecki wrote:
For a simple TermQuery, if the DF(term) is above 10%, the response time
from IndexSearcher.search() is around 400ms (repeatable, after warm-up).
For such complex phrase queries the response time is around 1 sec or
more (again, after warm-up).
Are you specifying -server
Andrzej,
On Friday 02 December 2005 12:55, Andrzej Bialecki wrote:
> Hi,
>
> I'm doing some performance profiling of a Nutch installation, working
> with relatively large individual indexes (10 mln docs), and I'm puzzled
> with the results.
>
> Here's the listing of the index:
> -rw-r--r-- 1
69 matches
Mail list logo