RE: Ideal Index Fragmentation

Friedland, Zachary (EDS - Strategy) Tue, 30 Aug 2005 18:54:01 -0700

Chris,

Thanks for your comments -- it's great to hear that people have had success 
with very large indexes.

I'll be running on a 4-CPU (3.8GHz, 2GB RAM) Windows 2000 box, so hopefully 
I'll get some advantages with the ParallelMultiSearcher... If anyone has some 
metrics to post on using the ParallelMultiSearcher on a multi-CPU box, I'd 
really appreciate it.

More assorted questions:

*       To summarize from below: I've got lots of distinct objects, each with 
lots of different fields (often with different boosts).  I have tried having 
the server be aware of all the indexes and construct large OR queries, but that 
takes tons of memory to run these queries.  As an alternative, I've 
concatenated all the core searchable fields into groups based on their boosts.  
So I end up with the same five fields in all indexes, and adjust the boost 
accordingly (SearchAll130, SearchAll120, SearchAll100, SearchAll90, 
SearchAll75).  This works pretty well in terms of results and is better with 
memory.  Has anybody else tried to do anything like this - how did it work out? 
 Is there a better strategy to search lots of distinct fields in lots of 
indexes?
*       I have been reading the posts on using Filter vs. BooleanQuery.  To 
implement a search-within-a-search, it seems the Filter is advantageous due to 
its cacheability, but are there other pros or cons that should be considered 
(memory, speed, etc).
*       I'm interested in implementing a "dynamic filter" component that will 
walk through the hits[] object and pull out distinct values for certain fields 
to display as search-within-a-search options (all of them will return at least 
one result since they are in the hits[]).  Has anyone implemented something 
like this -- how did it work out?
*       When using a ParallelMultiSearcher, is there an easy way to know how 
many matches came from each index searched?  I'd like to be able to display how 
many of each object are in the combined hits[].  Since I'm storing one object 
type per index, the count of hits from each index will give me that number.
*       Does anyone know when 1.9/2.0 will be released?

Sorry for the long post....

Thanks,

Zach

        -----Original Message----- 
        From: Chris Lamprecht [mailto:[EMAIL PROTECTED] 
        Sent: Tue 8/30/2005 8:01 PM 
        To: java-user@lucene.apache.org 
        Cc: 
        Subject: Re: Ideal Index Fragmentation

        Zach,
        It probably won't help performance to split the index and then search
        it on the same machine unless you search the indexes in parallel (with
        a multiprocessor or multi-core machine).  Even in this case, the disk
        is often a bottleneck, essentially preventing the search from really
        running in parallel.  Although if your index is in the filesystem
        cache this may be fast.

        Notice I said "probably", "often", "may be" in the above paragraph --
        you just have to performance test it and measure it to see.  Start
        with a simple, one-index setup and see if that works.  A 2GB index
        isn't itself a problem (under linux at least).    Then your queries
        and any pre/post-query processing become the bigger factors.  I
        haven't run into any filesize limits under linux with indexes up to
        12GB (and others here with even larger indexes).  The one exception is
        using Lucene 1.9's MMapDirectory-- it's limited to (I believe) 4GB on
        a 32-bit platform.

        On 8/30/05, Friedland, Zachary (EDS - Strategy)
        <[EMAIL PROTECTED]> wrote:
        > Does anyone have experience using lots of indexes simultaneously with
        > the multisearcher?  I'm looking to index 15 distinct objects for
        > searching, and was thinking of creating 15 distinct indexes for better
        > manageability & performance (for certain searches when I know which
        > index to search).
        >
        > Certain indexes will be very large (2-3 million documents), but most
        > will be 50,000-500,000 documents.  Each document will contain a good
        > number of fields (20-50).
        >
        > I know there are issues with file handles, but what happens to search
        > performance (parallel vs. sequential) over lots of indexes.  Am I 
better
        > off combining them into one huge index (2GB file limit may be an 
issue),
        > or some fragmentation in-between....
        >
        > Any advice that you have had with a similar project would be greatly
        > appreciated.
        >
        > Thanks,
        > Zach
        > --------------------------------------------------------
        >
        > If you are not an intended recipient of this e-mail, please notify 
the sender, delete it and do not read, act upon, print, disclose, copy, retain 
or redistribute it. Click here for important additional terms relating to this 
e-mail.     http://www.ml.com/email_terms/
        > --------------------------------------------------------
        >
        > ---------------------------------------------------------------------
        > To unsubscribe, e-mail: [EMAIL PROTECTED]
        > For additional commands, e-mail: [EMAIL PROTECTED]
        >
        >

        ---------------------------------------------------------------------
        To unsubscribe, e-mail: [EMAIL PROTECTED]
        For additional commands, e-mail: [EMAIL PROTECTED]

RE: Ideal Index Fragmentation

Reply via email to