Re: Index Sizes

2005-05-13 Thread Vince Taluskie
Yes, you'll be fine with 100 million, I've got a couple of non-performance sensitive indexes that are more than double that (280M) with about 20 seachable fields as well. We get results back in the 10-20 second range which is fine for our end users. Vince On 5/13/05, Richard Krenek <[EMAIL PRO

Index Sizes

2005-05-13 Thread Richard Krenek
Hypothetically I have 100 million records. Each record has 100+ fields. Only 20 of those fields need to be searched on, the rest (including the 20) are just for display purposes. Would it be best to just add the 20 fields to the index and keep the rest in a relational database? What affect does all

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-05-13 Thread Luke Francl
On Tue, 2005-03-01 at 19:23, Chris Hostetter wrote: > I don't really consider reading/writing to an NFS mounted FSDirectory to > be viable for the very reasons you listed; but I haven't really found any > evidence of problems if you take they approach that a single "writer" > node indexes to local

RE: Search Theory Book

2005-05-13 Thread Monsur Hossain
> -Original Message- > From: Ian Soboroff [mailto:[EMAIL PROTECTED] > > Grossman and Frieder's book, "Information Retrieval, Algorithms and > Heuristics", is out in a second (and much cheaper, too!) edition, > probably the most up-to-date textbook. Much along the same lines, I'm curio

Re: Search Theory Book

2005-05-13 Thread Ian Soboroff
Gary Moore <[EMAIL PROTECTED]> writes: > Salton, Gerald and McGill, Michael J. /Introduction to Modern > Information Retrieval/. McGraw-Hill, 1983. Not only hard to get ahold of these days, but really really really out of date. This book should be of historical interest only. Frakes and Baez

Re: No HighLights for Phrase Query

2005-05-13 Thread mark harwood
Are you sure that 1) Your tokenStream emits terms identical to those produced by the query - a difference in choice of analyzer will emit tokens which dont correspond for the same text eg "dog"!="Dog" 2) Your "body" string represents the same text of the field in the exact document which matched.

No HighLights for Phrase Query

2005-05-13 Thread Andrew Boyd
Hi, When I do a Phrase Query I do not get any highlights. Here is my call highlighter = new Highlighter(new QuerySocorer(query.rewrite(indexReader))) highlighter.getBestFragments(tokenStream, body, numPreviews, ELIPSE); I tried it with out the rewite but that didn't help. Thanks, Andrew --

Re: Lucene Search Capabilities.

2005-05-13 Thread Erik Hatcher
On May 12, 2005, at 10:24 AM, Goel, Nikhil wrote: 1) Lucene does the inverted indexing by which we mean it keeps how many times a particular token is used. Is there a way to find out the list of most frequently used words in the descending order. Have a look at Luke's code to see how it does th

finding potential duplicate documents

2005-05-13 Thread Marco Dissel
Hello I've got many documents that are potentially duplicate (merging several external systems). Any tips how to find documents that are potentially duplicate (using a variable ranking like >0.5 match).. I can use the similarity (MoreLikeThis) method from Sandbox, but that's always comparing

finding potential duplicate documents

2005-05-13 Thread Marco Dissel
Hello I've got many documents that are potentially duplicate (merging several external systems). Any tips how to find documents that are potentially duplicate (using a variable ranking like >0.5 match).. I can use the similarity (MoreLikeThis) method from Sandbox, but that's always comparing