stored field norm

2012-04-23 Thread Akos Tajti
Dear All,

when indexing an object I create a document that contains a field called
title. I set the boost of that field to 60. After the indexing was complete
I checked the document using luke. The norm field for it contained 40.
Shouldn't this column (the field norm) contain the boost that was set at
indexing time?

Thanks in advance,
Ákos Tajti


Re: stored field norm

2012-04-23 Thread Ian Lea
Look at norm(t,d) in the javadocs for Similarity.  Note use of the
word "encapsulates".  Also note the stuff on loss of precision.


--
Ian.


On Mon, Apr 23, 2012 at 12:11 PM, Akos Tajti  wrote:
> Dear All,
>
> when indexing an object I create a document that contains a field called
> title. I set the boost of that field to 60. After the indexing was complete
> I checked the document using luke. The norm field for it contained 40.
> Shouldn't this column (the field norm) contain the boost that was set at
> indexing time?
>
> Thanks in advance,
> Ákos Tajti

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: stored field norm

2012-04-23 Thread Akos Tajti
Thanks, Ian,

I checked the documentation and it turned out that the lengt normalization
made the norm so small. I started using SweetSpotSimilarity for that field
and now the scores are ok.

Ákos



On Mon, Apr 23, 2012 at 1:33 PM, Ian Lea  wrote:

> Look at norm(t,d) in the javadocs for Similarity.  Note use of the
> word "encapsulates".  Also note the stuff on loss of precision.
>
>
> --
> Ian.
>
>
> On Mon, Apr 23, 2012 at 12:11 PM, Akos Tajti  wrote:
> > Dear All,
> >
> > when indexing an object I create a document that contains a field called
> > title. I set the boost of that field to 60. After the indexing was
> complete
> > I checked the document using luke. The norm field for it contained 40.
> > Shouldn't this column (the field norm) contain the boost that was set at
> > indexing time?
> >
> > Thanks in advance,
> > Ákos Tajti
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

2012-04-23 Thread Earl Hood
On Mon, Apr 23, 2012 at 10:31 AM, Jong Kim wrote:

> Is there any good way to solve this design problem? Obviously, an
> alternative design would be to split the index into two, and maintain
> static (and large) data in one index and the other dynamic part in the
> other index. However, this approach is not acceptable due to our data
> pattern where the match on the first index yields very large result set,
> and filtering them against the second index is very inefficient due to high
> ratio of disjoint data. In other word, while the alternate approach
> significantly reduces the indexing-time overhead, resulting search is
> unacceptably expensive.

Have you tested to verify it is expensive?  If the meta document is
identified with a unique ID (that can be stored with the main document
so you know which meta document to retrieve), accessing the meta
document should be fairly efficient.

In the project I'm on (we are using Lucen 3.0.3), we just use
InderReader.termDocs() to retrieve a document based on a unique ID we
store in one of the documents fields.

--ewh

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: delete entries from posting list Lucene 4.0

2012-04-23 Thread Zeynep P.
Hi,

Thanks for the fix. 

I also wonder if you know any collection (free ones) to test pruning
approaches. Almost all the papers use TREC collections which I don't have!!
For now, I use Reuters21578 collection and Carmel's Kendall's tau extension
to measure similarity. But I need a collection with relevance judgements. 

Thanks in advance,
Best Regards
ZP

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delete-entries-from-posting-list-Lucene-4-0-tp3838649p3933206.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



PhoneticFilterFactory 's inject parameter

2012-04-23 Thread Elmer van Chastelet

Hi all,

(scroll to bottom for question)

I was setting up a simple web app to play around with phonetic filters.
The idea is simple, I just create a document for each word in the 
English dictionary, each document containing a single search field 
holding the value after it is preprocessed using the following analyzer 
def (in our own dsl syntax, which gets transformed to java):


analyzer soundslike{
tokenizer = KeywordTokenizer
tokenfilter = LowerCaseFilter
tokenfilter = PhoneticFilter(encoder="DoubleMetaphone", inject="true")
}

I can run the web app and I get results that indeed (in some way) sound 
like the original query term.


But what confuses me is the ranking of the results, knowing that I set 
the inject param to true. If I search for the query term 'compete', the 
parsed query becomes '(value:KMPT value:compete)', and therefore I 
expect the word 'compete' to be ranked highest in the list than any 
other word but this wasn't the case.


Looking further at the explanation of results, I saw that the term 
'compete' in the parsed query is totally absent, and only the phonetic 
encoding seems affect the ranking:


 * COMPETITOR
 o 4.368826 = (MATCH) sum of:
 + 4.368826 = (MATCH) weight(value:KMPT in 3174), product of:
 # 0.52838135 = queryWeight(value:KMPT), product of:
 * 8.26832 = idf(docFreq=150, maxDocs=216555)
 * 0.063904315 = queryNorm
 # 8.26832 = (MATCH) fieldWeight(value:KMPT in 3174),
   product of:
 * 1.0 = tf(termFreq(value:KMPT)=1)
 * 8.26832 = idf(docFreq=150, maxDocs=216555)
 * 1.0 = fieldNorm(field=value, doc=3174)

The next thing I did was running our friend Luke. In Luke, I opened the 
documents tab, and started iterating over some terms for the field 
'value' until I found 'compete'. When I hit 'Show All Docs', the search 
tab opens and it displays the one and only document holding this value 
(i.e. the document representing the word 'compete'). It shows the query: 
'value:compete '. Then, when I hit the search button again (query is 
still 'value:compete '), it says that there are no results !?


Probably, the 'Show All Docs' button does something different than 
performing a query using the search tab in Luke.


Q: Can somebody explain why the injected original terms seem to get 
ignored at query time? Or may it be related to the name of the search 
field ('value'), or something else?


We use Lucene 3.1 with SOLR analyzers (by Hibernate Search 3.4.2).

-Elmer




Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

2012-04-23 Thread Jong Kim
Thanks for the reply.

Our metadata is not stored in a single field, but is rather a collection of
fields. So, it requires a boolean search that spans multiple fields. My
understanding is that it is not possible to iterate over the matching
documents efficiently using termDocs() when the search involves multiple
terms and/or multiple fields, right?

/Jong

On Mon, Apr 23, 2012 at 11:58 AM, Earl Hood  wrote:

> On Mon, Apr 23, 2012 at 10:31 AM, Jong Kim wrote:
>
> > Is there any good way to solve this design problem? Obviously, an
> > alternative design would be to split the index into two, and maintain
> > static (and large) data in one index and the other dynamic part in the
> > other index. However, this approach is not acceptable due to our data
> > pattern where the match on the first index yields very large result set,
> > and filtering them against the second index is very inefficient due to
> high
> > ratio of disjoint data. In other word, while the alternate approach
> > significantly reduces the indexing-time overhead, resulting search is
> > unacceptably expensive.
>
> Have you tested to verify it is expensive?  If the meta document is
> identified with a unique ID (that can be stored with the main document
> so you know which meta document to retrieve), accessing the meta
> document should be fairly efficient.
>
> In the project I'm on (we are using Lucen 3.0.3), we just use
> InderReader.termDocs() to retrieve a document based on a unique ID we
> store in one of the documents fields.
>
> --ewh
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

2012-04-23 Thread Brandon Mintern
On Mon, Apr 23, 2012 at 1:25 PM, Jong Kim  wrote:
> Thanks for the reply.
>
> Our metadata is not stored in a single field, but is rather a collection of
> fields. So, it requires a boolean search that spans multiple fields. My
> understanding is that it is not possible to iterate over the matching
> documents efficiently using termDocs() when the search involves multiple
> terms and/or multiple fields, right?
>
> /Jong

You can do this by defining your own hits Collector which simply pulls
the matching ID out of each result. Since searching the second index
returns less results, you could do something like this:

Two indexes:
LightWeight - stores metadata fields and document ID
HeavyWeight - stores static data and document ID

Search query:
1. Metadata portion: query LightWeight and retrieve all matching IDs
(NOT Lucene IDs, but your own stored document ID) in a gnu.trove
TIntSet

Now some queries won't even hit the second index, and you have your
full match. If you need to match against the 2nd index as well:

2. Pass in the TIntSet as an argument to another Collector.
3. For each match in the HeavyWeight index, if it is also in the
TIntSet, add it to the final TIntSet result set. Otherwise ignore it.
4. After the collector has been visited by each match, the final
result set is your hits.

You now have the set of document IDs for the complete match. Using
primitives and lightweight objects, this isn't much worse than letting
Lucene do the collection.

Of course, this approach only works if the intersection between
metadata and big data is an AND relationship. If you need other logic,
step 3 above obviously changes.

Another caveat is that if you are relying on Lucene to store and
return the full document for each query, this approach isn't the best
for fetching information out of Lucene. We use a standard relational
database for storing our data, we use Lucene to query for sets of
document IDs, and then we fetch the remaining document fields from our
DB (or in some cases, some information lives on S3, etc.).

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org