Re: Any way to ignore repeated terms in TF calculation?

2009-01-15 Thread Israel Tsadok
Hi Umesh, > I am trying to put the problem more concisely. > 1. Fields where term frequency is very very relevant. E.g. > Body: > Example: >if TF of badger in Body of doc 1 > TF of badger in Body of doc 2 > doc 1 scores higher. > > 2. Fields where term frequency is irrevalent >

Re: Google finance-like suggestible search field

2009-01-15 Thread Asbjørn A . Fellinghaug
Hi. Such 'autocompletion' features with Lucene could be provided with n-gram tokenizers, as Erick states. I made a 'Bigram' analyzer for my master thesis, when I was doing some research on how to enhance phrase searching. This Analyzer considers pair of words as single terms. Basically, what the

Re: Testing Precision and Recall on Lucene

2009-01-15 Thread Murat Yakici
Let's please don't forget the scoring function. Yes, *query* is important, however, everyone in IR knows that two different scoring functions may return two different sets of results for the same query! David, I think you have to be more explicit here. What exactly are you trying to do? Are you g

Re: Testing Precision and Recall on Lucene

2009-01-15 Thread david muchangi
Dear All, Thanks for your feedback. I want to do research on how lucene performs compared to Latent Semantic analysis in terms of recall and precision. I welcome ideas on this,does anyone know a software tool using latent semantic analysis that I could also download and try it?At the moment I am

Term Frequency and IndexSearcher

2009-01-15 Thread Paul Lynch
Hi,   I know it is very easy to get the frequency of a given term using the indexReader but I am looking to perform an index search and would like to get the frequency of the given term in the result set. Is this possible?   Thanks in advance, Paul ---

Re: Testing Precision and Recall on Lucene

2009-01-15 Thread Donna L Gresh
I don't think this question makes a whole lot of sense in isolation-- precision and recall is all about the *query* and that is the art of the developer; what is the appropriate query for your particular application. Lucene does just great telling you which documents had which terms and which t

Re: Testing Precision and Recall on Lucene

2009-01-15 Thread Murat Yakici
I am not aware of any open source LSA framework out there. If you are interested in PLSA, Lemur has got an implementation. In a "simplest" sense Lucene is using a type of TFIDF scoring mechanism. If you are not really concerned with Lucene's particular implementation, then just use Lemur for your

Re: Term Frequency and IndexSearcher

2009-01-15 Thread Murat Yakici
Hi Paul, I am tempted to suggest the following ( I am assuming here that the document and the particular fields are TFVed when indexing): For every doc in the result set: - get the doc id - using the doc id, get the TermFreqVector of this document from the index reader (tfv=ireader.getTermFr

Re: Testing Precision and Recall on Lucene

2009-01-15 Thread Grant Ingersoll
On Jan 15, 2009, at 8:43 AM, Murat Yakici wrote: I am not aware of any open source LSA framework out there. If you are interested in PLSA, Lemur has got an implementation. In a "simplest" sense Lucene is using a type of TFIDF scoring mechanism. If you are not really concerned with Lucene's

clustering with compass & terracotta

2009-01-15 Thread Angel, Eric
I just ran into this http://www.compass-project.org/docs/2.0.0/reference/html/needle-terracot ta.html and was wondering if any of you had tried anything like this and if so, what your experience was like. Eric

Re: clustering with compass & terracotta

2009-01-15 Thread Glen Newton
There is a discussion here: http://www.terracotta.org/web/display/orgsite/Lucene+Integration Also of interest: "Katta - distribute lucene indexes in a grid" http://katta.wiki.sourceforge.net/ -glen http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html http://zzzoot.blo

RE: Google finance-like suggestible search field

2009-01-15 Thread Hayes, Peter
>First, it's a legitimate question whether matching on single-letter >prefixes is useful for the user. If you're running into TooManyClauses, >that means (if you haven't changed the defaults) that there are more >than 1024 possibilities. Which is far too many for the user to scan through. That is

RE: Google finance-like suggestible search field

2009-01-15 Thread Hayes, Peter
Thanks for your input. I will try and apply your suggestion. Thanks, Peter -Original Message- From: Asbjørn A. Fellinghaug [mailto:asbj...@fellinghaug.com] Sent: Thursday, January 15, 2009 3:25 AM To: java-user@lucene.apache.org Subject: Re: Google finance-like suggestible search field

Re: ORs and Ranks

2009-01-15 Thread Chris Hostetter
: The question I'm trying to phrase is: Is there a way to make the rank of : SHOULD term conditional? : : In the example, I'm trying to express "If the term MEDICAL is found, the : term CAT ranks high; if the term ANIMAL is found, the term CAT ranks low." except that there is an ambiguous si

Re: Any way to ignore repeated terms in TF calculation?

2009-01-15 Thread Chris Hostetter
: This is not quite what I was talking about. I was talking about documents : with a single field. I want the text "Badgers are mammals. Badgers are cute" : to score higher than the text "Badger Badger" for the term query : "text:badger". : Ideally, what I want is to add another factor to the scor

Re: lucene nicking my memory ?

2009-01-15 Thread Magnus Rundberget
Hi, I forgot to thank everyone who replied. It seems that caching the IndexSearcher (properly :-) did the trick in terms of more deterministic memory usage... and more importantly giving a substantial performance boost. Did lots of other optimization of the queries (using rangefilter ra