Re: two fields, the first important than the second

2012-04-27 Thread Akos Tajti
Thanks gfor the details explanation. But as I understand this query will still match only documents that contains both terms (either in the same field or in different). What if there's a document that contains only "hello"? This query will not find it, am I right? But what we want to achieve is thi

Re: lucene algorithm ?

2012-04-27 Thread Li Li
On Thu, Apr 26, 2012 at 5:13 AM, Yang wrote: > > I read the paper by Doug "Space optimizations for total ranking", > > since it was written a long time ago, I wonder what algorithms lucene uses > (regarding postings list traversal and score calculation, ranking) > > > particularly the total rankin

Re: two fields, the first important than the second

2012-04-27 Thread Li Li
+(title:hello title:world desc:hello desc:world) (+title:hello +title:world)^100 (+desc:hello +desc:world)^50 (+title:hello +desc:world)^10 (+desc:hello +title:world)^10 the boost values(100,50,10,10) should be carefully adjusted. if tf of a document is very large, 10 may be not enough. you can mo

Storing same field twice (analyzed+not-analyzed), sorting

2012-04-27 Thread Francisco A. Lozano
Hi, I'm storing a field two times, one analyzed and other non-analyzed, in order to be able to query for terms and for exact keyword: // Analyzed version d.add(new Field(key, value, Store.NO, Index.ANALYZED, T

Reverse keyword search?

2012-04-27 Thread Uncle
Hello, I am relatively new to Lucene, this might be a noob question, if so please redirect me. I'd like some guidance on how to use Lucene to address a problem. I have a set of a few hundred (and growing) user-defined keywords such as "spain" and "volkswagen" and each of which is associated to

Re: Storing same field twice (analyzed+not-analyzed), sorting

2012-04-27 Thread Erick Erickson
Hmmm, putting analyzed and unanalyzed values in the same field seems like it'd be difficult to get right. In the Solr world, two separate fields are usually used. Sorting is right out, the results are unpredictable. What does it mean to sort on a field with multiple tokens? For a doc with "aardva

RE: Storing same field twice (analyzed+not-analyzed), sorting

2012-04-27 Thread Vinaya Kumar Thimmappa
Why don't you store keywords related data in keywords field which can be analyzed and other field in as it is now. So all fields for which keywords is needed, move it to keywords section -v -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, April 27,

Re: Storing same field twice (analyzed+not-analyzed), sorting

2012-04-27 Thread Francisco A. Lozano
I cannot do that, I need to query for specific fields, both for the whole value in a term (keyword) and for fuzzy/phrase... For the sorting I will probably take Erick Ericksson's suggestion - use a separate non-analyzed field for sorting. Makes sense. The other problem (querying both by whole key

Re: Reverse keyword search?

2012-04-27 Thread Ahmet Arslan
> This appears to be somewhat the reverse of the typical > Lucene use case -- rather than having a set of say 1000 of > articles which are indexed, then issuing a query using a few > keywords to search on those articles, I have a set of say > 1000 keywords, and a single article, and I want to deter

Similarity coefficient for more exact matching

2012-04-27 Thread Maxim Terletsky
Hi guys, I have a field, Anayzed, Store.No. Suppose one Document with value inside this field "Hello". Another one "Hello world , one, two, three, four". Since the field is Analyzed (with norms), the "one two three four) will definitely affect the resulting rating in case we search for "Hello wor

Re: Similarity coefficient for more exact matching

2012-04-27 Thread Ian Lea
You can override org.apache.lucene.search.Similarity/DefaultSimilarity to tweak quite a lot of stuff. computeNorm() may be the method you are interested in. Called at indexing time so be sure to use the same implementation at index and query time, using IndexWriterConfig.setSimilarity() and Index

Re: lucene algorithm ?

2012-04-27 Thread Yang
Thanks Ralf. basically you are talking about selectivity of columns in a JOIN, right? but in my above example, "yellow dog", both terms are very common, and both have long postings lists. Yang On Thu, Apr 26, 2012 at 12:17 AM, Ralf Heyde wrote: > Hi, > > i dont know the correct implementati

Re: lucene algorithm ?

2012-04-27 Thread Yang
yes, that's why many search engines will not allow user visit page > number greater than a threshold. for most application, users usually > only visit top results. That's why ranking algorithm is important. if > you found your users always turn to next page, I think you should > consider your appli

Calculating IDF value more efficiently

2012-04-27 Thread Kasun Perera
This is my program to calculate TF-IDF value for a document in a collection of documents. This is working fine, but takes lot of time when calculating the "IDF" values (finding the no of documents which contains particular term). Is there a more efficient way of finding the no of documents which c

Indexing with Semantics

2012-04-27 Thread Kasun Perera
I'm using Lucene's Term Freq vector to calculate cosine similarity between documents, Say my docments has these 3 terms, "owe" "owed" "owing". Lucene takes this as 3 separate terms, but 3 of them means same "owe". Is there any functionality in Lucene that can be used to index by semantics? so that

Re: Indexing with Semantics

2012-04-27 Thread Li Li
stemmer semantic is a "large" word, care to use it. On Sat, Apr 28, 2012 at 11:02 AM, Kasun Perera wrote: > I'm using Lucene's Term Freq vector to calculate cosine similarity between > documents, Say my docments has these 3 terms, "owe" "owed" "owing". Lucene > takes this as 3 separate terms, but