Re: Announcement: Lucene powering CNET.com Product Category Listings

2005-08-30 Thread Chris Hostetter
: How large is the index? I'm not sure if i'm permitted to give out that info, but I do happen to recall seeing this page before... http://64.233.179.104/search?q=cache:qkHzwrcO1AAJ:www.cnetchannel.com/products/datasource.aspx+%22SKUs+in+production%22&hl=en ...so, yeah... you can draw whatever

Re: Search Results Clustering

2005-08-30 Thread Ray Tsang
I had similar requirements of "count" and "group by" on over 130mil records, it's really a pain. It's currently usable but not satisfactory. Currently it's grouping at run-time by iterating through ungrouped items. It collects matching documents into BitSet, so subsequent queries can use BitSet

Re: Search Results Clustering

2005-08-30 Thread kapilChhabra (sent by Nabble.com)
thanks a lot for your suggestion. I'll try it and get back if need be. Meanwhile, I gave it a thought and concluded that the best time to do the categorization/clustering should be lucene calculates Hits/in the Scrorer. I am not sure if I am right. In addition to the current functionality can w

Re: Announcement: Lucene powering CNET.com Product Category Listings

2005-08-30 Thread Chris Lu
Very nice implementation and a great write up. How large is the index? And when you keep posting new content to the index, will you optimize the index? -- Chris Lu Lucene Search RAD on Any Database http://www.dbsight.net On 8/30/05, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > I

Announcement: Lucene powering CNET.com Product Category Listings

2005-08-30 Thread Chris Hostetter
I'm pleased to announce that for about a month now, CNET's "Product Listing" pages are powered by Lucene 1.4.3. These pages not only allow users to browse CNET's catalog of tech products by category, but also to "Filter" the lists according to category specific Attribute Filters which are display

RE: Ideal Index Fragmentation

2005-08-30 Thread Friedland, Zachary (EDS - Strategy)
Chris, Thanks for your comments -- it's great to hear that people have had success with very large indexes. I'll be running on a 4-CPU (3.8GHz, 2GB RAM) Windows 2000 box, so hopefully I'll get some advantages with the ParallelMultiSearcher... If anyone has some metrics to post on using t

Re: permission control or category-wise search with Lucene

2005-08-30 Thread David Medinets
I explored the idea of Role-Based Access Control using Lucene at http://affy.blogspot.com/2003/04/using-lucene-for-role-based-access.html. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECT

Re: Ideal Index Fragmentation

2005-08-30 Thread Chris Lamprecht
Zach, It probably won't help performance to split the index and then search it on the same machine unless you search the indexes in parallel (with a multiprocessor or multi-core machine). Even in this case, the disk is often a bottleneck, essentially preventing the search from really running in pa

Ideal Index Fragmentation

2005-08-30 Thread Friedland, Zachary (EDS - Strategy)
Does anyone have experience using lots of indexes simultaneously with the multisearcher? I'm looking to index 15 distinct objects for searching, and was thinking of creating 15 distinct indexes for better manageability & performance (for certain searches when I know which index to search). Certai

Re: custom sort

2005-08-30 Thread Chris Hostetter
: You can just assign the field B some weight when creating the index? that implies that the field "A" being sorted on is SCORE ... which isn't allways the case. : Is it possible to write a custom sort for a query such that the first : N documents that match a certain additional criteria get pus

RE: Lucene + Persistence

2005-08-30 Thread Friedland, Zachary (EDS - Strategy)
Peter, Check out Compass: http://compass.sourceforge.net/ It is a layer that can integrate Hibernate and Lucene for you... Thanks, Zach -Original Message- From: Peter Gelderbloem [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 30, 2005 10:52 AM To: java-user@lucene.apache.org Subject:

RE: Index files in jar

2005-08-30 Thread Chris Hostetter
As discussed in the past... > The problem is that a jar file entry becomes an InputStream, but > InputStream is not random access, and Lucene requires random access. So > you need to extract the index either to disk or RAM in order to get http://mail-archives.apache.org/mod_mbox/lucene-java-use

Re: query performance behavior not as expected

2005-08-30 Thread Chris Hostetter
: The obvious answer here might be to use a filter for the first : (required) clause and then query again using that filter for the other : terms. The problem I forsee with that solution is that I can't easily : re-use the filters because of the sheer number of combinations of terms : and the nee

Re: Search Results Clustering

2005-08-30 Thread Chris Hostetter
: Suppose I cluster the results only on the 1st field i.e. I do not show : the constituent clusters. Even in this case, i'll require around 900 : Filters[i have 900 unique terms] in memory and will have to run the same : query 900 times, 1 on each Filter. I am sitting at a situation where I : get

Empty index after building?

2005-08-30 Thread Dan Quaroni
I didn't notice any exceptions and unfortunately I built these 2 long enough ago that I have no logs left. Anyway, I built 2 indexes using a process that I've built hundreds of indexes successfully with, and these two indexes seem to contain no documents despite being pretty large (about a gig)

Re: Did you mean?

2005-08-30 Thread markharw00d
The "did you mean" implementation should ideally use all of the other words in a query as context to guide the selection of spelling alternatives. Google appear to do this - not sure if they use the doc content or user queries to suggest the alternatives. I've got some colocation finding code wh

Re: Did you mean?

2005-08-30 Thread Otis Gospodnetic
I wonder if it would further help for the spell checked to make use of something like WordNet (for English only), where low-frequency words are "double-checked" against WordNet before considered correct. Otis --- Tom White <[EMAIL PROTECTED]> wrote: > On 8/29/05, Chris Lu <[EMAIL PROTECTED]> wro

Lucene + Persistence

2005-08-30 Thread Peter Gelderbloem
Hi, First off, I would just like to thank everyone who has contributed to the gem we all know was Lucene. I am thinking of using Lucene purely for text indexing and using a persistence mechanism like hibernate to search structured data. Would it be a good idea to use filters that do hibernate quer

Re: Lucene in clustered environment

2005-08-30 Thread Erik Hatcher
Seema - please stop cross-posting your mails to those three e-mail lists. java-user is the most appropriate list for your posts. Erik On Aug 30, 2005, at 8:07 AM, seema pai wrote: How to use Lucene with File system Indexing on WebSphere application server deployed in a cluster ? On

Re: Inconsistent tokenizing of words containing underscores.

2005-08-30 Thread Erik Hatcher
Another solution would be for you to create a custom TokenFilter that split tokens at "_" characters and then a custom Analyzer that used that filter after the StandardTokenizer. Erik On Aug 30, 2005, at 6:52 AM, Is, Studcio wrote: Hello, first of all thanks to everyone for replies a

Re: Books about Lucene?

2005-08-30 Thread Karl Koch
Hello group, thank you for all your discussion, suggestios and help. I thought I will run some investgations on that sourcecode with Lucene 1.2 and document them. With the help of chen I might be able to create a version that can do the job. Perhaps we can then create some small footprint solution

RE: custom sort

2005-08-30 Thread Mordo, Aviran (EXP N-NANNATEK)
When using sort there is no meaning for weight. Aviran http://www.aviransplace.com -Original Message- From: Chris Lu [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 30, 2005 12:35 AM To: java-user@lucene.apache.org; raymondcreel Subject: Re: custom sort You can just assign the field B

Re: Lucene in clustered environment

2005-08-30 Thread seema pai
How to use Lucene with File system Indexing on WebSphere application server deployed in a cluster ? On 8/30/05, seema pai <[EMAIL PROTECTED]> wrote: > > Hi > > My site has large database of Television and Movie titles, in English, > Spanish language. The movie data starts from year 1928 ti

RE: Corrupted indexes

2005-08-30 Thread Eric Bressler
I was going to send out the answer to this problem this morning. I found it around 2am last night. I was mistaken that I had corrupt the indexes. The real problem was that I had forgotten the constructor I was using for the index reader an had it like writer = new IndexWriter(indexlocation, new

RE: Inconsistent tokenizing of words containing underscores.

2005-08-30 Thread Is, Studcio
Hello, first of all thanks to everyone for replies and suggestions. I solved my problem by adapting the StandardTokenizer.jj and compiling it using javacc. I replaced line 90: |)+ > with ||"_")+ > so that underscore is treated like alphanumeric characters. In my first tests, it seems to work

RE: Corrupted indexes

2005-08-30 Thread Peter Veentjer - Anchor Men
What kind of corruption do you get? Do the files get corrupted (unusable/unreadable), or do you get multiple items in the index? -Oorspronkelijk bericht- Van: Eric Bressler [mailto:[EMAIL PROTECTED] Verzonden: maandag 29 augustus 2005 23:18 Aan: java-user@lucene.apache.org Onderwerp: Cor

Re: index files in jar file

2005-08-30 Thread Miles Barr
On Fri, 2005-08-26 at 16:31 -0400, Thomas Lepkowski wrote: > I have a set of index files that I'd like to distribute with my Java > application. The only way this seems practical is to place the index files > in a jar file. I tries this, but the search choked when I told IndexSearcher > the inde

Re: Did you mean?

2005-08-30 Thread Tom White
On 8/29/05, Chris Lu <[EMAIL PROTECTED]> wrote: > > > Two approaches I can think of: > * Use a word list(it may not be the word list you want, but it is just > a compromise). > * Analyze your original index, listing out all words inside. > > Using a word list suffers from two problems: 1. (Cove

Re: IndexReader delete(int i)

2005-08-30 Thread dozean
Hi Yonik, thank you very much!! Now it works very well!! The formula "numDocs() == maxDocs() - numer_of_deleted_docs" should be stand in the API! :) Thank you again! Bye Derya > --- Ursprüngliche Nachricht --- > Von: Yonik Seeley <[EMAIL PROTECTED]> > An: java-user@lucene.apache.org > Betreff: