Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index
Am Dienstag, den 24.04.2012, 21:57 +0530 schrieb KARTHIK SHIVAKUMAR: > Simple Techniques is to use "Update Index" for the dynamic data > colum > > rather then re-indexing the whole document. Just for interest, how do you do that? smime.p7s Description: S/MIME cryptographic signature
Re: PhoneticFilterFactory 's inject parameter
Problem solved. Long story short: for some reason I had deleted documents in the index and the non-deleted documents used the phonetic filter with inject set to false. Works fine now :) On 04/23/2012 09:27 PM, Elmer van Chastelet wrote: Hi all, (scroll to bottom for question) I was setting up a simple web app to play around with phonetic filters. The idea is simple, I just create a document for each word in the English dictionary, each document containing a single search field holding the value after it is preprocessed using the following analyzer def (in our own dsl syntax, which gets transformed to java): analyzer soundslike{ tokenizer = KeywordTokenizer tokenfilter = LowerCaseFilter tokenfilter = PhoneticFilter(encoder="DoubleMetaphone", inject="true") } I can run the web app and I get results that indeed (in some way) sound like the original query term. But what confuses me is the ranking of the results, knowing that I set the inject param to true. If I search for the query term 'compete', the parsed query becomes '(value:KMPT value:compete)', and therefore I expect the word 'compete' to be ranked highest in the list than any other word but this wasn't the case. Looking further at the explanation of results, I saw that the term 'compete' in the parsed query is totally absent, and only the phonetic encoding seems affect the ranking: * COMPETITOR o 4.368826 = (MATCH) sum of: + 4.368826 = (MATCH) weight(value:KMPT in 3174), product of: # 0.52838135 = queryWeight(value:KMPT), product of: * 8.26832 = idf(docFreq=150, maxDocs=216555) * 0.063904315 = queryNorm # 8.26832 = (MATCH) fieldWeight(value:KMPT in 3174), product of: * 1.0 = tf(termFreq(value:KMPT)=1) * 8.26832 = idf(docFreq=150, maxDocs=216555) * 1.0 = fieldNorm(field=value, doc=3174) The next thing I did was running our friend Luke. In Luke, I opened the documents tab, and started iterating over some terms for the field 'value' until I found 'compete'. When I hit 'Show All Docs', the search tab opens and it displays the one and only document holding this value (i.e. the document representing the word 'compete'). It shows the query: 'value:compete '. Then, when I hit the search button again (query is still 'value:compete '), it says that there are no results !? Probably, the 'Show All Docs' button does something different than performing a query using the search tab in Luke. Q: Can somebody explain why the injected original terms seem to get ignored at query time? Or may it be related to the name of the search field ('value'), or something else? We use Lucene 3.1 with SOLR analyzers (by Hibernate Search 3.4.2). -Elmer
Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index
There's no update-in-place, currently you _have_ to re-index the entire document. But to the original question: There is a "limited join" capability you might investigate that would allow you to split up the textual data and metadata into two different documents and join them. I don't know how well it scales, but it may fit your needs. It turns out that update-in-place is more than a bit difficult given the nature of the inverted index. There are some proposals for addressing this, but nothing has gotten beyond the design stage as far as I know. Best Erick On Wed, Apr 25, 2012 at 3:07 AM, Torsten Krah wrote: > Am Dienstag, den 24.04.2012, 21:57 +0530 schrieb KARTHIK SHIVAKUMAR: >> Simple Techniques is to use "Update Index" for the dynamic data >> colum >> >> rather then re-indexing the whole document. > > Just for interest, how do you do that? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: PhoneticFilterFactory 's inject parameter
I keep replying to myself, it all gets a bit confusing. The problem still exists and I don't understand why, and why it worked once. I have the same behavior again as posted in my first mail: - Inject parameter is set to true. - The index has _no deleted documents_ and is optimized. - The term 'compete' is in there. - If I ask Luke to show all docs for term 'compete' it shows me the one and only document that represents this word. But... - If I perform the query 'value:compete' in luke again, it says there are no results. Here is the index I'm currently using. It contains various fields for the available phonetic filter encoders: https://www.box.com/s/34212e82227e102f6734 Can somebody explain this behavior? What's the real use of the inject parameter of the PhoneticFilterFactory? Thanks in advance. -Elmer On 04/25/2012 12:25 PM, Elmer van Chastelet wrote: Problem solved. Long story short: for some reason I had deleted documents in the index and the non-deleted documents used the phonetic filter with inject set to false. Works fine now :) On 04/23/2012 09:27 PM, Elmer van Chastelet wrote: Hi all, (scroll to bottom for question) I was setting up a simple web app to play around with phonetic filters. The idea is simple, I just create a document for each word in the English dictionary, each document containing a single search field holding the value after it is preprocessed using the following analyzer def (in our own dsl syntax, which gets transformed to java): analyzer soundslike{ tokenizer = KeywordTokenizer tokenfilter = LowerCaseFilter tokenfilter = PhoneticFilter(encoder="DoubleMetaphone", inject="true") } I can run the web app and I get results that indeed (in some way) sound like the original query term. But what confuses me is the ranking of the results, knowing that I set the inject param to true. If I search for the query term 'compete', the parsed query becomes '(value:KMPT value:compete)', and therefore I expect the word 'compete' to be ranked highest in the list than any other word but this wasn't the case. Looking further at the explanation of results, I saw that the term 'compete' in the parsed query is totally absent, and only the phonetic encoding seems affect the ranking: * COMPETITOR o 4.368826 = (MATCH) sum of: + 4.368826 = (MATCH) weight(value:KMPT in 3174), product of: # 0.52838135 = queryWeight(value:KMPT), product of: * 8.26832 = idf(docFreq=150, maxDocs=216555) * 0.063904315 = queryNorm # 8.26832 = (MATCH) fieldWeight(value:KMPT in 3174), product of: * 1.0 = tf(termFreq(value:KMPT)=1) * 8.26832 = idf(docFreq=150, maxDocs=216555) * 1.0 = fieldNorm(field=value, doc=3174) The next thing I did was running our friend Luke. In Luke, I opened the documents tab, and started iterating over some terms for the field 'value' until I found 'compete'. When I hit 'Show All Docs', the search tab opens and it displays the one and only document holding this value (i.e. the document representing the word 'compete'). It shows the query: 'value:compete '. Then, when I hit the search button again (query is still 'value:compete '), it says that there are no results !? Probably, the 'Show All Docs' button does something different than performing a query using the search tab in Luke. Q: Can somebody explain why the injected original terms seem to get ignored at query time? Or may it be related to the name of the search field ('value'), or something else? We use Lucene 3.1 with SOLR analyzers (by Hibernate Search 3.4.2). -Elmer
Re: PhoneticFilterFactory 's inject parameter
You seem to be quietly going round in circles, by yourself! I suggest a small self-contained program/test case with a RAM index created from scratch. You can then experiment with inject on or off and if you still can't figure it out, post the code and hopefully someone will be able to help you make sense of it. Make sure you tell us what version of Lucene you are using. If not the latest, wouldn't hurt to try with the latest. -- Ian. On Wed, Apr 25, 2012 at 1:22 PM, Elmer van Chastelet wrote: > I keep replying to myself, it all gets a bit confusing. > The problem still exists and I don't understand why, and why it worked once. > > I have the same behavior again as posted in my first mail: > - Inject parameter is set to true. > - The index has _no deleted documents_ and is optimized. > - The term 'compete' is in there. > - If I ask Luke to show all docs for term 'compete' it shows me the one and > only document that represents this word. But... > - If I perform the query 'value:compete' in luke again, it says there are no > results. > > Here is the index I'm currently using. It contains various fields for the > available phonetic filter encoders: > https://www.box.com/s/34212e82227e102f6734 > > Can somebody explain this behavior? What's the real use of the inject > parameter of the PhoneticFilterFactory? > > Thanks in advance. > > -Elmer > > > On 04/25/2012 12:25 PM, Elmer van Chastelet wrote: >> >> Problem solved. Long story short: for some reason I had deleted documents >> in the index and the non-deleted documents used the phonetic filter with >> inject set to false. >> >> Works fine now :) >> >> On 04/23/2012 09:27 PM, Elmer van Chastelet wrote: >>> >>> Hi all, >>> >>> (scroll to bottom for question) >>> >>> I was setting up a simple web app to play around with phonetic filters. >>> The idea is simple, I just create a document for each word in the English >>> dictionary, each document containing a single search field holding the value >>> after it is preprocessed using the following analyzer def (in our own dsl >>> syntax, which gets transformed to java): >>> >>> analyzer soundslike{ >>> tokenizer = KeywordTokenizer >>> tokenfilter = LowerCaseFilter >>> tokenfilter = PhoneticFilter(encoder="DoubleMetaphone", inject="true") >>> } >>> >>> I can run the web app and I get results that indeed (in some way) sound >>> like the original query term. >>> >>> But what confuses me is the ranking of the results, knowing that I set >>> the inject param to true. If I search for the query term 'compete', the >>> parsed query becomes '(value:KMPT value:compete)', and therefore I expect >>> the word 'compete' to be ranked highest in the list than any other word >>> but this wasn't the case. >>> >>> Looking further at the explanation of results, I saw that the term >>> 'compete' in the parsed query is totally absent, and only the phonetic >>> encoding seems affect the ranking: >>> >>> * COMPETITOR >>> o 4.368826 = (MATCH) sum of: >>> + 4.368826 = (MATCH) weight(value:KMPT in 3174), product of: >>> # 0.52838135 = queryWeight(value:KMPT), product of: >>> * 8.26832 = idf(docFreq=150, maxDocs=216555) >>> * 0.063904315 = queryNorm >>> # 8.26832 = (MATCH) fieldWeight(value:KMPT in 3174), >>> product of: >>> * 1.0 = tf(termFreq(value:KMPT)=1) >>> * 8.26832 = idf(docFreq=150, maxDocs=216555) >>> * 1.0 = fieldNorm(field=value, doc=3174) >>> >>> The next thing I did was running our friend Luke. In Luke, I opened the >>> documents tab, and started iterating over some terms for the field 'value' >>> until I found 'compete'. When I hit 'Show All Docs', the search tab opens >>> and it displays the one and only document holding this value (i.e. the >>> document representing the word 'compete'). It shows the query: >>> 'value:compete '. Then, when I hit the search button again (query is still >>> 'value:compete '), it says that there are no results !? >>> >>> Probably, the 'Show All Docs' button does something different than >>> performing a query using the search tab in Luke. >>> >>> Q: Can somebody explain why the injected original terms seem to get >>> ignored at query time? Or may it be related to the name of the search field >>> ('value'), or something else? >>> >>> We use Lucene 3.1 with SOLR analyzers (by Hibernate Search 3.4.2). >>> >>> -Elmer >>> >>> >> > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: PhoneticFilterFactory 's inject parameter
Thanks for your suggestion Ian, but I just found out that if I replace the KeywordTokenizer with a WhitespaceTokenizer, all seems to work fine. Just to test what happens, I created another field 'orig', using this analyzer: analyzer KeywordLowered{ tokenizer = KeywordTokenizer tokenfilter = LowerCaseFilter } Guess what.. exactly the same problem, also in Luke. It finds no documents with for query: orig:strange While the term 'strange' is in the index for the field 'orig'. Does anybody have a clue why documents are not matched when using the KeywordTokenizer? Remember that all queries and terms don't contain white spaces. Thanks again. -Elmer On 04/25/2012 02:53 PM, Ian Lea wrote: You seem to be quietly going round in circles, by yourself! I suggest a small self-contained program/test case with a RAM index created from scratch. You can then experiment with inject on or off and if you still can't figure it out, post the code and hopefully someone will be able to help you make sense of it. Make sure you tell us what version of Lucene you are using. If not the latest, wouldn't hurt to try with the latest. -- Ian. On Wed, Apr 25, 2012 at 1:22 PM, Elmer van Chastelet wrote: I keep replying to myself, it all gets a bit confusing. The problem still exists and I don't understand why, and why it worked once. I have the same behavior again as posted in my first mail: - Inject parameter is set to true. - The index has _no deleted documents_ and is optimized. - The term 'compete' is in there. - If I ask Luke to show all docs for term 'compete' it shows me the one and only document that represents this word. But... - If I perform the query 'value:compete' in luke again, it says there are no results. Here is the index I'm currently using. It contains various fields for the available phonetic filter encoders: https://www.box.com/s/34212e82227e102f6734 Can somebody explain this behavior? What's the real use of the inject parameter of the PhoneticFilterFactory? Thanks in advance. -Elmer On 04/25/2012 12:25 PM, Elmer van Chastelet wrote: Problem solved. Long story short: for some reason I had deleted documents in the index and the non-deleted documents used the phonetic filter with inject set to false. Works fine now :) On 04/23/2012 09:27 PM, Elmer van Chastelet wrote: Hi all, (scroll to bottom for question) I was setting up a simple web app to play around with phonetic filters. The idea is simple, I just create a document for each word in the English dictionary, each document containing a single search field holding the value after it is preprocessed using the following analyzer def (in our own dsl syntax, which gets transformed to java): analyzer soundslike{ tokenizer = KeywordTokenizer tokenfilter = LowerCaseFilter tokenfilter = PhoneticFilter(encoder="DoubleMetaphone", inject="true") } I can run the web app and I get results that indeed (in some way) sound like the original query term. But what confuses me is the ranking of the results, knowing that I set the inject param to true. If I search for the query term 'compete', the parsed query becomes '(value:KMPT value:compete)', and therefore I expect the word 'compete' to be ranked highest in the list than any other word but this wasn't the case. Looking further at the explanation of results, I saw that the term 'compete' in the parsed query is totally absent, and only the phonetic encoding seems affect the ranking: * COMPETITOR o 4.368826 = (MATCH) sum of: + 4.368826 = (MATCH) weight(value:KMPT in 3174), product of: # 0.52838135 = queryWeight(value:KMPT), product of: * 8.26832 = idf(docFreq=150, maxDocs=216555) * 0.063904315 = queryNorm # 8.26832 = (MATCH) fieldWeight(value:KMPT in 3174), product of: * 1.0 = tf(termFreq(value:KMPT)=1) * 8.26832 = idf(docFreq=150, maxDocs=216555) * 1.0 = fieldNorm(field=value, doc=3174) The next thing I did was running our friend Luke. In Luke, I opened the documents tab, and started iterating over some terms for the field 'value' until I found 'compete'. When I hit 'Show All Docs', the search tab opens and it displays the one and only document holding this value (i.e. the document representing the word 'compete'). It shows the query: 'value:compete '. Then, when I hit the search button again (query is still 'value:compete '), it says that there are no results !? Probably, the 'Show All Docs' button does something different than performing a query using the search tab in Luke. Q: Can somebody explain why the injected original terms seem to get ignored at query time? Or may it be related to the name of the search field ('value'), or something else? We use Lucene 3.1 with SOLR analyzers (by Hibernate Search 3.4.2). -Elmer - To unsubscribe, e-mail: ja
Re: lucene algorithm ?
additionally, anybody knows roughly (of course the details are a secret, but I guess the main ideas should be common enough these days) how google does fast ranking in cases of multi-term queries with AND ? (if their postings are sorted by PageRank order, then it's understandable that a single term query would quickly return the top-k, but if it's multi-term, they would have to traverse the entire lists to find the insersection set, because the lists are not sorted by docId, as in the Lucene paper case) On Wed, Apr 25, 2012 at 2:13 PM, Yang wrote: > I read the paper by Doug "Space optimizations for total ranking", > > since it was written a long time ago, I wonder what algorithms lucene uses > (regarding postings list traversal and score calculation, ranking) > > > particularly the total ranking algorithm described there needs to traverse > down the entire postings list for all the query terms, > so in case of very common query terms like "yellow dog", either of the 2 > terms may have a very very long postings list in case of web search, > are they all really traversed in current lucene/Solr ? or any heuristics > to truncate the list are actually employed? > > in the case of returning top-k results, I can understand that partitioning > the postings list into multiple machines, and then combining the top-k > from each would work, > but if we are required to return "the 100th result page", i.e. results > ranked from 990--1000th, then each partition would still have to find out > the top 1000, so > partitioning would not help much. > > > overall, is there any up-to-date detailed docs on the internal algorithms > of lucene? > > Thanks a lot > Yang >
Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index
Hi >>"Update Index" for the dynamic data I have done this in Past ..It worked for me long time ago, All u need is have a piece of Code to Search and find the Specific Doc within the Index's ( probably using the Unique name for document ) Then delete the same and insert the same Fresh Document alone. All of this need to be done in Iteration for large set of docs. with regards karthik On Wed, Apr 25, 2012 at 12:37 PM, Torsten Krah < tk...@fachschaft.imn.htwk-leipzig.de> wrote: > Am Dienstag, den 24.04.2012, 21:57 +0530 schrieb KARTHIK SHIVAKUMAR: > > Simple Techniques is to use "Update Index" for the dynamic data > > colum > > > > rather then re-indexing the whole document. > > Just for interest, how do you do that? > -- *N.S.KARTHIK R.M.S.COLONY BEHIND BANK OF INDIA R.M.V 2ND STAGE BANGALORE 560094*