stored field norm
Dear All, when indexing an object I create a document that contains a field called title. I set the boost of that field to 60. After the indexing was complete I checked the document using luke. The norm field for it contained 40. Shouldn't this column (the field norm) contain the boost that was set at indexing time? Thanks in advance, Ákos Tajti
Re: stored field norm
Look at norm(t,d) in the javadocs for Similarity. Note use of the word "encapsulates". Also note the stuff on loss of precision. -- Ian. On Mon, Apr 23, 2012 at 12:11 PM, Akos Tajti wrote: > Dear All, > > when indexing an object I create a document that contains a field called > title. I set the boost of that field to 60. After the indexing was complete > I checked the document using luke. The norm field for it contained 40. > Shouldn't this column (the field norm) contain the boost that was set at > indexing time? > > Thanks in advance, > Ákos Tajti - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: stored field norm
Thanks, Ian, I checked the documentation and it turned out that the lengt normalization made the norm so small. I started using SweetSpotSimilarity for that field and now the scores are ok. Ákos On Mon, Apr 23, 2012 at 1:33 PM, Ian Lea wrote: > Look at norm(t,d) in the javadocs for Similarity. Note use of the > word "encapsulates". Also note the stuff on loss of precision. > > > -- > Ian. > > > On Mon, Apr 23, 2012 at 12:11 PM, Akos Tajti wrote: > > Dear All, > > > > when indexing an object I create a document that contains a field called > > title. I set the boost of that field to 60. After the indexing was > complete > > I checked the document using luke. The norm field for it contained 40. > > Shouldn't this column (the field norm) contain the boost that was set at > > indexing time? > > > > Thanks in advance, > > Ákos Tajti > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index
On Mon, Apr 23, 2012 at 10:31 AM, Jong Kim wrote: > Is there any good way to solve this design problem? Obviously, an > alternative design would be to split the index into two, and maintain > static (and large) data in one index and the other dynamic part in the > other index. However, this approach is not acceptable due to our data > pattern where the match on the first index yields very large result set, > and filtering them against the second index is very inefficient due to high > ratio of disjoint data. In other word, while the alternate approach > significantly reduces the indexing-time overhead, resulting search is > unacceptably expensive. Have you tested to verify it is expensive? If the meta document is identified with a unique ID (that can be stored with the main document so you know which meta document to retrieve), accessing the meta document should be fairly efficient. In the project I'm on (we are using Lucen 3.0.3), we just use InderReader.termDocs() to retrieve a document based on a unique ID we store in one of the documents fields. --ewh - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: delete entries from posting list Lucene 4.0
Hi, Thanks for the fix. I also wonder if you know any collection (free ones) to test pruning approaches. Almost all the papers use TREC collections which I don't have!! For now, I use Reuters21578 collection and Carmel's Kendall's tau extension to measure similarity. But I need a collection with relevance judgements. Thanks in advance, Best Regards ZP -- View this message in context: http://lucene.472066.n3.nabble.com/delete-entries-from-posting-list-Lucene-4-0-tp3838649p3933206.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
PhoneticFilterFactory 's inject parameter
Hi all, (scroll to bottom for question) I was setting up a simple web app to play around with phonetic filters. The idea is simple, I just create a document for each word in the English dictionary, each document containing a single search field holding the value after it is preprocessed using the following analyzer def (in our own dsl syntax, which gets transformed to java): analyzer soundslike{ tokenizer = KeywordTokenizer tokenfilter = LowerCaseFilter tokenfilter = PhoneticFilter(encoder="DoubleMetaphone", inject="true") } I can run the web app and I get results that indeed (in some way) sound like the original query term. But what confuses me is the ranking of the results, knowing that I set the inject param to true. If I search for the query term 'compete', the parsed query becomes '(value:KMPT value:compete)', and therefore I expect the word 'compete' to be ranked highest in the list than any other word but this wasn't the case. Looking further at the explanation of results, I saw that the term 'compete' in the parsed query is totally absent, and only the phonetic encoding seems affect the ranking: * COMPETITOR o 4.368826 = (MATCH) sum of: + 4.368826 = (MATCH) weight(value:KMPT in 3174), product of: # 0.52838135 = queryWeight(value:KMPT), product of: * 8.26832 = idf(docFreq=150, maxDocs=216555) * 0.063904315 = queryNorm # 8.26832 = (MATCH) fieldWeight(value:KMPT in 3174), product of: * 1.0 = tf(termFreq(value:KMPT)=1) * 8.26832 = idf(docFreq=150, maxDocs=216555) * 1.0 = fieldNorm(field=value, doc=3174) The next thing I did was running our friend Luke. In Luke, I opened the documents tab, and started iterating over some terms for the field 'value' until I found 'compete'. When I hit 'Show All Docs', the search tab opens and it displays the one and only document holding this value (i.e. the document representing the word 'compete'). It shows the query: 'value:compete '. Then, when I hit the search button again (query is still 'value:compete '), it says that there are no results !? Probably, the 'Show All Docs' button does something different than performing a query using the search tab in Luke. Q: Can somebody explain why the injected original terms seem to get ignored at query time? Or may it be related to the name of the search field ('value'), or something else? We use Lucene 3.1 with SOLR analyzers (by Hibernate Search 3.4.2). -Elmer
Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index
Thanks for the reply. Our metadata is not stored in a single field, but is rather a collection of fields. So, it requires a boolean search that spans multiple fields. My understanding is that it is not possible to iterate over the matching documents efficiently using termDocs() when the search involves multiple terms and/or multiple fields, right? /Jong On Mon, Apr 23, 2012 at 11:58 AM, Earl Hood wrote: > On Mon, Apr 23, 2012 at 10:31 AM, Jong Kim wrote: > > > Is there any good way to solve this design problem? Obviously, an > > alternative design would be to split the index into two, and maintain > > static (and large) data in one index and the other dynamic part in the > > other index. However, this approach is not acceptable due to our data > > pattern where the match on the first index yields very large result set, > > and filtering them against the second index is very inefficient due to > high > > ratio of disjoint data. In other word, while the alternate approach > > significantly reduces the indexing-time overhead, resulting search is > > unacceptably expensive. > > Have you tested to verify it is expensive? If the meta document is > identified with a unique ID (that can be stored with the main document > so you know which meta document to retrieve), accessing the meta > document should be fairly efficient. > > In the project I'm on (we are using Lucen 3.0.3), we just use > InderReader.termDocs() to retrieve a document based on a unique ID we > store in one of the documents fields. > > --ewh > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index
On Mon, Apr 23, 2012 at 1:25 PM, Jong Kim wrote: > Thanks for the reply. > > Our metadata is not stored in a single field, but is rather a collection of > fields. So, it requires a boolean search that spans multiple fields. My > understanding is that it is not possible to iterate over the matching > documents efficiently using termDocs() when the search involves multiple > terms and/or multiple fields, right? > > /Jong You can do this by defining your own hits Collector which simply pulls the matching ID out of each result. Since searching the second index returns less results, you could do something like this: Two indexes: LightWeight - stores metadata fields and document ID HeavyWeight - stores static data and document ID Search query: 1. Metadata portion: query LightWeight and retrieve all matching IDs (NOT Lucene IDs, but your own stored document ID) in a gnu.trove TIntSet Now some queries won't even hit the second index, and you have your full match. If you need to match against the 2nd index as well: 2. Pass in the TIntSet as an argument to another Collector. 3. For each match in the HeavyWeight index, if it is also in the TIntSet, add it to the final TIntSet result set. Otherwise ignore it. 4. After the collector has been visited by each match, the final result set is your hits. You now have the set of document IDs for the complete match. Using primitives and lightweight objects, this isn't much worse than letting Lucene do the collection. Of course, this approach only works if the intersection between metadata and big data is an AND relationship. If you need other logic, step 3 above obviously changes. Another caveat is that if you are relying on Lucene to store and return the full document for each query, this approach isn't the best for fetching information out of Lucene. We use a standard relational database for storing our data, we use Lucene to query for sets of document IDs, and then we fetch the remaining document fields from our DB (or in some cases, some information lives on S3, etc.). - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org