Re: performance implications for an index with large number of documents.

Otis Gospodnetic Wed, 25 Jan 2006 11:31:02 -0800

Hi,
Quick reactions:
- Do use -server option, it makes a difference, and I don't think there is much 
to test there (I've never run a daemon-like service without the -server option, 
and have seen the improvement in performance due to HotSpot with my own eyes)
- Optimizing every hour sounds like a bad idea.  Instead of re-optimizing so 
often and rewriting the whole index to disk (slow), consider changing your 
mergeFactor.


Otis

----- Original Message ----
From: Ori Schnaps <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tue 24 Jan 2006 11:39:11 AM EST
Subject: Re: performance implications for an index with large number of 
documents.

hi,

Thank you for all the quick and pertinent responses.

The index is being optimized every hour due to the number of updates. 
The JVM has a heap of 2gig and the machine has a total of 4. 
Currently the JVM is not configured with -server parameter and the
parallel garbage collection (we are testing that configuration).

The high ration of unique terms in the documents is mainly due to two
sets of unique identifiers.  The larger set does not need to be index
since that key is not utilize it in any query and as such we are going
to change that field to UnIndex from Keyword.

The queries are ad hoc, i.e. from users.  There is one primary field
that is used for the initially query.  The other fields are used as
filters on the data.

The initial query can return several thousand results.  Out of the
total hits we typically use the top 100 - 200.

As for the need for the aggressive updates, shrug, a business decision.

thanks much,
ori

On 1/24/06, Michael D. Curtin <[EMAIL PROTECTED]> wrote:
> Hi Ori,
>
> Before taking drastic rehosting measures, and introducing the associated
> software complexity off splitting your application into pieces running
> on separate machines, I'd recommend looking at the way your document
> data is distributed and the way you're searching them.  Here are some
> questions that may help you find a less-complex solution:
>
> -   Is your high ratio of unique terms to documents due to a unique
> identifier in the documents?  If so, are you performing wildcard or
> range searches on that field?
>
> -   Are your queries "canned", i.e. hard-coded in form, or are they "ad
> hoc", coming from users?
>
> -   Do your queries refer to every field you've indexed?  On a similar
> note, does your application use every field you've indexed or stored in
> Lucene?
>
> -   How many documents do your queries hit typically?  How many of those
> hits do you typically use?
>
> -   How important is it that queries are run on up-to-the-second data?
> In other words, would the hits be pretty much as useful if the updates
> were batched up for a few runs per day, instead of continuous?
>
>
> One of the things I really like about Lucene is that one can quickly
> whip up an application and it basically works.  But, like most
> databases, small differences in organization can produce
> disproportionately large differences in performance when there are
> millions of rows/records/entries.  A little time spent examining data
> distribution and access patterns can go a long way.
>
> Good luck!
>
> --MDC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: performance implications for an index with large number of documents.

Reply via email to