RE: Simple Search Question.

2005-03-14 Thread Kyong Kwak
Thanks, works like a charm. -Original Message- From: Paul Elschot [mailto:[EMAIL PROTECTED] Sent: Monday, March 14, 2005 11:05 AM To: java-user@lucene.apache.org Subject: Re: Simple Search Question. On Monday 14 March 2005 19:59, Kyong Kwak wrote: > > I looked and didn't find anything

Re: Removing similar documents from search results

2005-03-14 Thread Dawid Weiss
I think what they do at Google is a fancy heuristic -- as David Spencer mentioned, suburls of a given page, identical snippets, or titles... My idea was more towards providing a 'realistic overview' of subjects in pages. So you could pick, say, the first document from each cluster and show them

Re: Removing similar documents from search results

2005-03-14 Thread David Spencer
Otis Gospodnetic wrote: The problem with 2c is that scores are currently relative, and not absolute. I am hoping Chuck's patch makes it into the source, as making scores absolute would be helpful in situations like this one. Good point. If the orig MoreLikeThis query allows the source doc to be re

Re: Simple Search Question.

2005-03-14 Thread Otis Gospodnetic
This will help: http://lucene.apache.org/java/docs/api/org/apache/lucene/index/TermEnum.html Otis --- Kyong Kwak <[EMAIL PROTECTED]> wrote: > > I looked and didn't find anything and wanted to know what the best > way > might be for getting a unique list of values in a given field? > so if I hav

Re: Simple Search Question.

2005-03-14 Thread Paul Elschot
On Monday 14 March 2005 19:59, Kyong Kwak wrote: > > I looked and didn't find anything and wanted to know what the best way > might be for getting a unique list of values in a given field? > so if I have a field named "category" ( it's a keyword ) and I wanted to > get all the unique values for th

Re: Luceneweb.war

2005-03-14 Thread Otis Gospodnetic
Nutch is a full-blown search engine (fetcher/crawler, web links databases, etc.). luceneweb.war is simply a web-app with with a Lucene demo. Lucene is only a toolkit, not a full-blows application. Otis --- Hasan Diwan <[EMAIL PROTECTED]> wrote: > I just checked out a copy of the svn sources a

Re: Removing similar documents from search results

2005-03-14 Thread Otis Gospodnetic
The problem with 2c is that scores are currently relative, and not absolute. I am hoping Chuck's patch makes it into the source, as making scores absolute would be helpful in situations like this one. Otis --- David Spencer <[EMAIL PROTECTED]> wrote: > Miles Barr wrote: > > > Has anyone tried

Simple Search Question.

2005-03-14 Thread Kyong Kwak
I looked and didn't find anything and wanted to know what the best way might be for getting a unique list of values in a given field? so if I have a field named "category" ( it's a keyword ) and I wanted to get all the unique values for that, how would I go about it? thanks!

Luceneweb.war

2005-03-14 Thread Hasan Diwan
I just checked out a copy of the svn sources and was wondering what the difference is between luceneweb.war and nutch. I'm certain there must be differences, else there wouldn't be two different projects. -- Cheers, Hasan Diwan <[EMAIL PROTECTED]> -

Re: Removing similar documents from search results

2005-03-14 Thread David Spencer
Miles Barr wrote: Has anyone tried to remove similar documents from their search results? It looks like Google does some on the fly filtering of the results, hiding pages which is thinks are too similar, i.e. when you see: "In order to show you the most relevant results, we have omitted some entrie

Re: Removing similar documents from search results

2005-03-14 Thread Miles Barr
Hi Dawid, On Mon, 2005-03-14 at 18:55 +0100, Dawid Weiss wrote: > I can imagine if you apply clustering to search results anyway then the > information about clusters can help you determine 'similar' results and > reorder the output list. That's an interesting idea. How easy is it to 'tighten'

Re: Removing similar documents from search results

2005-03-14 Thread Dawid Weiss
Hi Miles :) I can imagine if you apply clustering to search results anyway then the information about clusters can help you determine 'similar' results and reorder the output list. Just a thought. D. Miles Barr wrote: Has anyone tried to remove similar documents from their search results? It loo

Removing similar documents from search results

2005-03-14 Thread Miles Barr
Has anyone tried to remove similar documents from their search results? It looks like Google does some on the fly filtering of the results, hiding pages which is thinks are too similar, i.e. when you see: "In order to show you the most relevant results, we have omitted some entries very similar to

WildCard search replacement

2005-03-14 Thread Volodymyr Bychkoviak
Hi all. I have large index of documents (about 1.6 millions) One field (for example called “number”) contains string of digits. I need to do wildcard search on this field such as “*expression*” (i.e. all documents that contains “expression” in this field. When I run such search with very short e

RE: SPECIFIC HIT

2005-03-14 Thread Karthik N S
Hi Guys The process is correct , but It is Impossible to have the optional terms. The Documents we Index is in millions with similar word trailers . Any other Ideas , Please advise Thx in advance Karthik -Original Message- From: sergiu gordea [mailto:[EMAIL PROTECTED] Sent: Monday

Re: SPECIFIC HIT

2005-03-14 Thread sergiu gordea
Karthik N S wrote: Hi Guys Is there a way around for which the query parser would have something like this (+digital +camera +optics) -(All other Default variables) But a run time Once cannot determine the default values. I am stuck in between for this cause :(D You can ask the u

RE: SPECIFIC HIT

2005-03-14 Thread Karthik N S
Hi Guys Is there a way around for which the query parser would have something like this (+digital +camera +optics) -(All other Default variables) But a run time Once cannot determine the default values. I am stuck in between for this cause :(D -Original Message- From: ser

RE: SPECIFIC HIT

2005-03-14 Thread Bram Kouwenberg
Well, If I understand the workings of the TF/IDF model used by Lucene correctly, then doc 6 should score lower than 3 because of the extra noise caused by 'CABEL ACCESSORIES', and setting the threshold high enough for feedback of the highest score should do the trick. Right? Bram Kouwenberg

Re: SPECIFIC HIT

2005-03-14 Thread sergiu gordea
Karthik N S wrote: ** *Hi Guys* *Apologies...* *I have Indexed documents sucessfully and they would be **Document 1 contains = ELECTRONICS DIGITAL CAMERA ***Document 2 contains = ELECTRONICS DIGITAL CAMERA BATTERY ACCESSORIES* *Document 3 contains* = ELECTRONICS DIGITAL CAME

SPECIFIC HIT

2005-03-14 Thread Karthik N S
  Hi  Guys Apologies... I have Indexed documents sucessfully and they would be Document 1 contains   =  ELECTRONICS  DIGITAL CAMERA Document 2 contains   =  ELECTRONICS  DIGITAL CAMERA BATTERY ACCESSORIESDocument 3 contains   =  ELECTRONICS  DIGITAL CAMERA 0PTICSDocument 4 conta