Re: Performance of never optimizing

2008-11-05 Thread Paul Smith
I don't believe our large users to have enough memory for Lucene indexes to fit in RAM. (Especially given we use quite a bit of RAM for other stuff.) I think we also close readers pretty frequently (whenever any user updates a JIRA issue, which I am assuming happening nearly constantly

Re: Performance of never optimizing

2008-11-05 Thread Michael McCandless
Otis Gospodnetic wrote: Our current default behaviour is a merge factor of 4. We perform an optimization on the index every 4000 additions. We also perform an optimize at midnight. Our I wouldn't optimize every 4000 additions - you are killing IO, rewriting the whole index, while trying

Re: Performance of never optimizing

2008-11-05 Thread Michael McCandless
Justus Pendleton wrote: On 05/11/2008, at 4:36 AM, Michael McCandless wrote: If possible, you should try to use a larger corpus (eg Wikipedia) rather than multiply Reuters by N, which creates unnatural term frequency distribution. I'll replicate the tests with the wikipedia corpus over th

Re: Performance of never optimizing

2008-11-05 Thread Michael McCandless
Tomer Gabel wrote: Since you're using an 8-core Mac Pro I also assume you have some sort of RAID setup, which means your storage subsystem can physically handle more than one concurrent request, which can only come into play with multiple segments. This is an important point: a multi-seg

Re: Performance of never optimizing

2008-11-05 Thread Yonik Seeley
On Wed, Nov 5, 2008 at 9:47 AM, Tomer Gabel <[EMAIL PROTECTED]> wrote: > 1. Higher merge factor => more segments. Right, and it's also important to note that it's only "on average" more segments. The number of segments go up and down with merging, so at particular points in time, an index with a h

Re: Performance of never optimizing

2008-11-05 Thread Tomer Gabel
- -- http://www.tomergabel.com Tomer Gabel -- View this message in context: http://www.nabble.com/Performance-of-never-optimizing-tp20296914p20343051.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. ---

Re: Performance of never optimizing

2008-11-04 Thread Justus Pendleton
On 05/11/2008, at 4:36 AM, Michael McCandless wrote: If possible, you should try to use a larger corpus (eg Wikipedia) rather than multiply Reuters by N, which creates unnatural term frequency distribution. I'll replicate the tests with the wikipedia corpus over the next few days and regen

Re: Performance of never optimizing

2008-11-04 Thread Michael McCandless
If possible, you should try to use a larger corpus (eg Wikipedia) rather than multiply Reuters by N, which creates unnatural term frequency distribution. The graphs are hard to read because of the spline interpolation. Maybe you could overlay X's where there is a real datapoint? After

Re: Performance of never optimizing

2008-11-04 Thread Toke Eskildsen
On Mon, 2008-11-03 at 23:37 +0100, Justus Pendleton wrote: > What constitutes a "proper warm up before measuring"? The simplest way is to do a number of searches before you start measuring. The first searches are always very slow, compared to later searches. If you look at http://wiki.statsbiblio

Re: Performance of never optimizing

2008-11-03 Thread Mark Miller
Been a while since I've been in the benchmark stuff, so I am going to take some time to look at this when I get a chance, but off the cuff I think you are open and closing the reader for each search. Try using the openreader task before the 100 searches and then the closereader task. That will

Re: Performance of never optimizing

2008-11-03 Thread Justus Pendleton
On 03/11/2008, at 11:07 PM, Mark Miller wrote: Am I missing your benchmark algorithm somewhere? We need it. Something doesn't make sense. I thought I had included in at[1] before but apparently not, my apologies for that. I have updated that wiki page. I'll also reproduce it here: { "Ro

Re: Performance of never optimizing

2008-11-03 Thread Toke Eskildsen
On Mon, 2008-11-03 at 04:42 +0100, Justus Pendleton wrote: > 1. Why does the merge factor of 4 appear to be faster than the merge > factor of 2? Because you alternate between updating the index and searching? With 4 segments, chances are that most of the segment-data will be unchanged between sear

Re: Performance of never optimizing

2008-11-03 Thread Mark Miller
Am I missing your benchmark algorithm somewhere? We need it. Something doesn't make sense. - Mark Justus Pendleton wrote: Howdy, I have a couple of questions regarding some Lucene benchmarking and what the results mean[3]. (Skip to the numbered list at the end if you don't want to read the

RE: Performance of never optimizing

2008-11-03 Thread Ard Schrijvers
Hello Justus, Chris and Otis, IIRC Ocean [1] by Jason Rutherglen addresses the issue for real time searches on large data sets. A conceptually comparable implementation is done for Jackrabbit, where you can see an enlighting picture over here [2]. In short: 1) IndexReaders are opened only once

Re: Performance of never optimizing

2008-11-02 Thread Chris Lu
Hi, Justus, I had met with very similar problems as JIRA has, which has high modification and on a large data volume. It's a pretty common use case for Lucene. The way I dealt with high rate of modification is to create a secondary in-memory index. And only persist documents older than a per

Re: Performance of never optimizing

2008-11-02 Thread Justus Pendleton
On 03/11/2008, at 4:27 PM, Otis Gospodnetic wrote: Why are you optimizing? Trying to make the search faster? I would try to avoid optimizing during high usage periods. I assume that the original, long-ago, decision to optimize was made to improve searching performance. One thing that you

Re: Performance of never optimizing

2008-11-02 Thread Otis Gospodnetic
Hello, Very quick comments. - Original Message > From: Justus Pendleton <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Sunday, November 2, 2008 10:42:52 PM > Subject: Performance of never optimizing > > Howdy, > > I have a couple of q

Performance of never optimizing

2008-11-02 Thread Justus Pendleton
Howdy, I have a couple of questions regarding some Lucene benchmarking and what the results mean[3]. (Skip to the numbered list at the end if you don't want to read the lengthy exegesis :) I'm a developer for JIRA[1]. We are currently trying to get a better understanding of Lucene, and ou