Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

2012-04-25 Thread Torsten Krah
Am Dienstag, den 24.04.2012, 21:57 +0530 schrieb KARTHIK SHIVAKUMAR:
> Simple Techniques is  to use  "Update Index"  for the dynamic data
> colum
> 
> rather then re-indexing the whole document. 

Just for interest, how do you do that?


smime.p7s
Description: S/MIME cryptographic signature


Re: PhoneticFilterFactory 's inject parameter

2012-04-25 Thread Elmer van Chastelet
Problem solved. Long story short: for some reason I had deleted 
documents in the index and the non-deleted documents used the phonetic 
filter with inject set to false.


Works fine now :)

On 04/23/2012 09:27 PM, Elmer van Chastelet wrote:

Hi all,

(scroll to bottom for question)

I was setting up a simple web app to play around with phonetic filters.
The idea is simple, I just create a document for each word in the 
English dictionary, each document containing a single search field 
holding the value after it is preprocessed using the following 
analyzer def (in our own dsl syntax, which gets transformed to java):


analyzer soundslike{
tokenizer = KeywordTokenizer
tokenfilter = LowerCaseFilter
tokenfilter = PhoneticFilter(encoder="DoubleMetaphone", inject="true")
}

I can run the web app and I get results that indeed (in some way) 
sound like the original query term.


But what confuses me is the ranking of the results, knowing that I set 
the inject param to true. If I search for the query term 'compete', 
the parsed query becomes '(value:KMPT value:compete)', and therefore I 
expect the word 'compete' to be ranked highest in the list than any 
other word but this wasn't the case.


Looking further at the explanation of results, I saw that the term 
'compete' in the parsed query is totally absent, and only the phonetic 
encoding seems affect the ranking:


  * COMPETITOR
  o 4.368826 = (MATCH) sum of:
  + 4.368826 = (MATCH) weight(value:KMPT in 3174), product of:
  # 0.52838135 = queryWeight(value:KMPT), product of:
  * 8.26832 = idf(docFreq=150, maxDocs=216555)
  * 0.063904315 = queryNorm
  # 8.26832 = (MATCH) fieldWeight(value:KMPT in 3174),
product of:
  * 1.0 = tf(termFreq(value:KMPT)=1)
  * 8.26832 = idf(docFreq=150, maxDocs=216555)
  * 1.0 = fieldNorm(field=value, doc=3174)

The next thing I did was running our friend Luke. In Luke, I opened 
the documents tab, and started iterating over some terms for the field 
'value' until I found 'compete'. When I hit 'Show All Docs', the 
search tab opens and it displays the one and only document holding 
this value (i.e. the document representing the word 'compete'). It 
shows the query: 'value:compete '. Then, when I hit the search button 
again (query is still 'value:compete '), it says that there are no 
results !?


Probably, the 'Show All Docs' button does something different than 
performing a query using the search tab in Luke.


Q: Can somebody explain why the injected original terms seem to get 
ignored at query time? Or may it be related to the name of the search 
field ('value'), or something else?


We use Lucene 3.1 with SOLR analyzers (by Hibernate Search 3.4.2).

-Elmer






Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

2012-04-25 Thread Erick Erickson
There's no update-in-place, currently you _have_ to re-index the
entire document.

But to the original question:

There is a "limited join" capability you might investigate that would
allow you to split up the textual data and metadata into two different
documents and join them. I don't know how well it scales, but it may
fit your needs.

It turns out that update-in-place is more than a bit difficult given the
nature of the inverted index. There are some proposals for addressing
this, but nothing has gotten beyond the design stage as far as I know.

Best
Erick

On Wed, Apr 25, 2012 at 3:07 AM, Torsten Krah
 wrote:
> Am Dienstag, den 24.04.2012, 21:57 +0530 schrieb KARTHIK SHIVAKUMAR:
>> Simple Techniques is  to use  "Update Index"  for the dynamic data
>> colum
>>
>> rather then re-indexing the whole document.
>
> Just for interest, how do you do that?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: PhoneticFilterFactory 's inject parameter

2012-04-25 Thread Elmer van Chastelet

I keep replying to myself, it all gets a bit confusing.
The problem still exists and I don't understand why, and why it worked once.

I have the same behavior again as posted in my first mail:
- Inject parameter is set to true.
- The index has _no deleted documents_ and is optimized.
- The term 'compete' is in there.
- If I ask Luke to show all docs for term 'compete' it shows me the one 
and only document that represents this word. But...
- If I perform the query 'value:compete' in luke again, it says there 
are no results.


Here is the index I'm currently using. It contains various fields for 
the available phonetic filter encoders:

https://www.box.com/s/34212e82227e102f6734

Can somebody explain this behavior? What's the real use of the inject 
parameter of the PhoneticFilterFactory?


Thanks in advance.

-Elmer


On 04/25/2012 12:25 PM, Elmer van Chastelet wrote:
Problem solved. Long story short: for some reason I had deleted 
documents in the index and the non-deleted documents used the phonetic 
filter with inject set to false.


Works fine now :)

On 04/23/2012 09:27 PM, Elmer van Chastelet wrote:

Hi all,

(scroll to bottom for question)

I was setting up a simple web app to play around with phonetic filters.
The idea is simple, I just create a document for each word in the 
English dictionary, each document containing a single search field 
holding the value after it is preprocessed using the following 
analyzer def (in our own dsl syntax, which gets transformed to java):


analyzer soundslike{
tokenizer = KeywordTokenizer
tokenfilter = LowerCaseFilter
tokenfilter = PhoneticFilter(encoder="DoubleMetaphone", 
inject="true")

}

I can run the web app and I get results that indeed (in some way) 
sound like the original query term.


But what confuses me is the ranking of the results, knowing that I 
set the inject param to true. If I search for the query term 
'compete', the parsed query becomes '(value:KMPT value:compete)', and 
therefore I expect the word 'compete' to be ranked highest in the 
list than any other word but this wasn't the case.


Looking further at the explanation of results, I saw that the term 
'compete' in the parsed query is totally absent, and only the 
phonetic encoding seems affect the ranking:


  * COMPETITOR
  o 4.368826 = (MATCH) sum of:
  + 4.368826 = (MATCH) weight(value:KMPT in 3174), product of:
  # 0.52838135 = queryWeight(value:KMPT), product of:
  * 8.26832 = idf(docFreq=150, maxDocs=216555)
  * 0.063904315 = queryNorm
  # 8.26832 = (MATCH) fieldWeight(value:KMPT in 3174),
product of:
  * 1.0 = tf(termFreq(value:KMPT)=1)
  * 8.26832 = idf(docFreq=150, maxDocs=216555)
  * 1.0 = fieldNorm(field=value, doc=3174)

The next thing I did was running our friend Luke. In Luke, I opened 
the documents tab, and started iterating over some terms for the 
field 'value' until I found 'compete'. When I hit 'Show All Docs', 
the search tab opens and it displays the one and only document 
holding this value (i.e. the document representing the word 
'compete'). It shows the query: 'value:compete '. Then, when I hit 
the search button again (query is still 'value:compete '), it says 
that there are no results !?


Probably, the 'Show All Docs' button does something different than 
performing a query using the search tab in Luke.


Q: Can somebody explain why the injected original terms seem to get 
ignored at query time? Or may it be related to the name of the search 
field ('value'), or something else?


We use Lucene 3.1 with SOLR analyzers (by Hibernate Search 3.4.2).

-Elmer








Re: PhoneticFilterFactory 's inject parameter

2012-04-25 Thread Ian Lea
You seem to be quietly going round in circles, by yourself!  I suggest
a small self-contained program/test case with a RAM index created from
scratch.  You can then experiment with inject on or off and if you
still can't figure it out, post the code and hopefully someone will be
able to help you make sense of it.

Make sure you tell us what version of Lucene you are using.  If not
the latest, wouldn't hurt to try with the latest.


--
Ian.


On Wed, Apr 25, 2012 at 1:22 PM, Elmer van Chastelet
 wrote:
> I keep replying to myself, it all gets a bit confusing.
> The problem still exists and I don't understand why, and why it worked once.
>
> I have the same behavior again as posted in my first mail:
> - Inject parameter is set to true.
> - The index has _no deleted documents_ and is optimized.
> - The term 'compete' is in there.
> - If I ask Luke to show all docs for term 'compete' it shows me the one and
> only document that represents this word. But...
> - If I perform the query 'value:compete' in luke again, it says there are no
> results.
>
> Here is the index I'm currently using. It contains various fields for the
> available phonetic filter encoders:
> https://www.box.com/s/34212e82227e102f6734
>
> Can somebody explain this behavior? What's the real use of the inject
> parameter of the PhoneticFilterFactory?
>
> Thanks in advance.
>
> -Elmer
>
>
> On 04/25/2012 12:25 PM, Elmer van Chastelet wrote:
>>
>> Problem solved. Long story short: for some reason I had deleted documents
>> in the index and the non-deleted documents used the phonetic filter with
>> inject set to false.
>>
>> Works fine now :)
>>
>> On 04/23/2012 09:27 PM, Elmer van Chastelet wrote:
>>>
>>> Hi all,
>>>
>>> (scroll to bottom for question)
>>>
>>> I was setting up a simple web app to play around with phonetic filters.
>>> The idea is simple, I just create a document for each word in the English
>>> dictionary, each document containing a single search field holding the value
>>> after it is preprocessed using the following analyzer def (in our own dsl
>>> syntax, which gets transformed to java):
>>>
>>> analyzer soundslike{
>>>    tokenizer = KeywordTokenizer
>>>    tokenfilter = LowerCaseFilter
>>>    tokenfilter = PhoneticFilter(encoder="DoubleMetaphone", inject="true")
>>> }
>>>
>>> I can run the web app and I get results that indeed (in some way) sound
>>> like the original query term.
>>>
>>> But what confuses me is the ranking of the results, knowing that I set
>>> the inject param to true. If I search for the query term 'compete', the
>>> parsed query becomes '(value:KMPT value:compete)', and therefore I expect
>>> the word 'compete' to be ranked highest in the list than any other word
>>> but this wasn't the case.
>>>
>>> Looking further at the explanation of results, I saw that the term
>>> 'compete' in the parsed query is totally absent, and only the phonetic
>>> encoding seems affect the ranking:
>>>
>>>  * COMPETITOR
>>>      o 4.368826 = (MATCH) sum of:
>>>          + 4.368826 = (MATCH) weight(value:KMPT in 3174), product of:
>>>              # 0.52838135 = queryWeight(value:KMPT), product of:
>>>                  * 8.26832 = idf(docFreq=150, maxDocs=216555)
>>>                  * 0.063904315 = queryNorm
>>>              # 8.26832 = (MATCH) fieldWeight(value:KMPT in 3174),
>>>                product of:
>>>                  * 1.0 = tf(termFreq(value:KMPT)=1)
>>>                  * 8.26832 = idf(docFreq=150, maxDocs=216555)
>>>                  * 1.0 = fieldNorm(field=value, doc=3174)
>>>
>>> The next thing I did was running our friend Luke. In Luke, I opened the
>>> documents tab, and started iterating over some terms for the field 'value'
>>> until I found 'compete'. When I hit 'Show All Docs', the search tab opens
>>> and it displays the one and only document holding this value (i.e. the
>>> document representing the word 'compete'). It shows the query:
>>> 'value:compete '. Then, when I hit the search button again (query is still
>>> 'value:compete '), it says that there are no results !?
>>>
>>> Probably, the 'Show All Docs' button does something different than
>>> performing a query using the search tab in Luke.
>>>
>>> Q: Can somebody explain why the injected original terms seem to get
>>> ignored at query time? Or may it be related to the name of the search field
>>> ('value'), or something else?
>>>
>>> We use Lucene 3.1 with SOLR analyzers (by Hibernate Search 3.4.2).
>>>
>>> -Elmer
>>>
>>>
>>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: PhoneticFilterFactory 's inject parameter

2012-04-25 Thread Elmer van Chastelet
Thanks for your suggestion Ian, but I just found out that if I replace 
the KeywordTokenizer with a WhitespaceTokenizer, all seems to work fine.


Just to test what happens, I created another field 'orig', using this 
analyzer:

analyzer KeywordLowered{
tokenizer = KeywordTokenizer
tokenfilter = LowerCaseFilter
}

Guess what.. exactly the same problem, also in Luke.
It finds no documents with for query:
orig:strange
While the term 'strange' is in the index for the field 'orig'.

Does anybody have a clue why documents are not matched when using the 
KeywordTokenizer? Remember that all queries and terms don't contain 
white spaces.



Thanks again.
-Elmer


On 04/25/2012 02:53 PM, Ian Lea wrote:

You seem to be quietly going round in circles, by yourself!  I suggest
a small self-contained program/test case with a RAM index created from
scratch.  You can then experiment with inject on or off and if you
still can't figure it out, post the code and hopefully someone will be
able to help you make sense of it.

Make sure you tell us what version of Lucene you are using.  If not
the latest, wouldn't hurt to try with the latest.


--
Ian.


On Wed, Apr 25, 2012 at 1:22 PM, Elmer van Chastelet
  wrote:

I keep replying to myself, it all gets a bit confusing.
The problem still exists and I don't understand why, and why it worked once.

I have the same behavior again as posted in my first mail:
- Inject parameter is set to true.
- The index has _no deleted documents_ and is optimized.
- The term 'compete' is in there.
- If I ask Luke to show all docs for term 'compete' it shows me the one and
only document that represents this word. But...
- If I perform the query 'value:compete' in luke again, it says there are no
results.

Here is the index I'm currently using. It contains various fields for the
available phonetic filter encoders:
https://www.box.com/s/34212e82227e102f6734

Can somebody explain this behavior? What's the real use of the inject
parameter of the PhoneticFilterFactory?

Thanks in advance.

-Elmer


On 04/25/2012 12:25 PM, Elmer van Chastelet wrote:

Problem solved. Long story short: for some reason I had deleted documents
in the index and the non-deleted documents used the phonetic filter with
inject set to false.

Works fine now :)

On 04/23/2012 09:27 PM, Elmer van Chastelet wrote:

Hi all,

(scroll to bottom for question)

I was setting up a simple web app to play around with phonetic filters.
The idea is simple, I just create a document for each word in the English
dictionary, each document containing a single search field holding the value
after it is preprocessed using the following analyzer def (in our own dsl
syntax, which gets transformed to java):

analyzer soundslike{
tokenizer = KeywordTokenizer
tokenfilter = LowerCaseFilter
tokenfilter = PhoneticFilter(encoder="DoubleMetaphone", inject="true")
}

I can run the web app and I get results that indeed (in some way) sound
like the original query term.

But what confuses me is the ranking of the results, knowing that I set
the inject param to true. If I search for the query term 'compete', the
parsed query becomes '(value:KMPT value:compete)', and therefore I expect
the word 'compete' to be ranked highest in the list than any other word
but this wasn't the case.

Looking further at the explanation of results, I saw that the term
'compete' in the parsed query is totally absent, and only the phonetic
encoding seems affect the ranking:

  * COMPETITOR
  o 4.368826 = (MATCH) sum of:
  + 4.368826 = (MATCH) weight(value:KMPT in 3174), product of:
  # 0.52838135 = queryWeight(value:KMPT), product of:
  * 8.26832 = idf(docFreq=150, maxDocs=216555)
  * 0.063904315 = queryNorm
  # 8.26832 = (MATCH) fieldWeight(value:KMPT in 3174),
product of:
  * 1.0 = tf(termFreq(value:KMPT)=1)
  * 8.26832 = idf(docFreq=150, maxDocs=216555)
  * 1.0 = fieldNorm(field=value, doc=3174)

The next thing I did was running our friend Luke. In Luke, I opened the
documents tab, and started iterating over some terms for the field 'value'
until I found 'compete'. When I hit 'Show All Docs', the search tab opens
and it displays the one and only document holding this value (i.e. the
document representing the word 'compete'). It shows the query:
'value:compete '. Then, when I hit the search button again (query is still
'value:compete '), it says that there are no results !?

Probably, the 'Show All Docs' button does something different than
performing a query using the search tab in Luke.

Q: Can somebody explain why the injected original terms seem to get
ignored at query time? Or may it be related to the name of the search field
('value'), or something else?

We use Lucene 3.1 with SOLR analyzers (by Hibernate Search 3.4.2).

-Elmer



-
To unsubscribe, e-mail: ja

Re: lucene algorithm ?

2012-04-25 Thread Yang
additionally,  anybody knows roughly (of course the details are a secret,
but I guess the main ideas should be
common enough these days) how google does fast ranking in cases of
multi-term queries with AND ?
(if their postings are sorted by PageRank order, then it's understandable
that a single term query would quickly return the top-k, but if it's
multi-term, they would have to traverse the entire lists to find the
insersection set, because the lists are not sorted by docId, as in the
Lucene paper case)



On Wed, Apr 25, 2012 at 2:13 PM, Yang  wrote:

> I read the paper by Doug "Space optimizations for total ranking",
>
> since it was written a long time ago, I wonder what algorithms lucene uses
> (regarding postings list traversal and score calculation, ranking)
>
>
> particularly the total ranking algorithm described there needs to traverse
> down the entire postings list for all the query terms,
> so in case of very common query terms like "yellow dog", either of the 2
> terms may have a very very long postings list in case of web search,
> are they all really traversed in current lucene/Solr ? or  any heuristics
> to truncate the list are actually employed?
>
> in the case of returning top-k results, I can understand that partitioning
> the postings list into multiple machines, and then combining the  top-k
> from each would work,
> but if we are required to return "the 100th result page", i.e. results
> ranked from 990--1000th, then each partition would still have to find out
> the top 1000, so
> partitioning would not help much.
>
>
> overall, is there any up-to-date detailed docs on the internal algorithms
> of lucene?
>
> Thanks a lot
> Yang
>


Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

2012-04-25 Thread KARTHIK SHIVAKUMAR
Hi

>>"Update Index"  for the dynamic data

I have done this in Past ..It  worked for me long time ago,

All u need is have a piece of  Code to Search and find the Specific Doc
within the Index's  ( probably using the Unique name for document )
Then delete the same and insert the same Fresh Document alone.

All of this need to be done in Iteration for large set of docs.




with regards
karthik


On Wed, Apr 25, 2012 at 12:37 PM, Torsten Krah <
tk...@fachschaft.imn.htwk-leipzig.de> wrote:

> Am Dienstag, den 24.04.2012, 21:57 +0530 schrieb KARTHIK SHIVAKUMAR:
> > Simple Techniques is  to use  "Update Index"  for the dynamic data
> > colum
> >
> > rather then re-indexing the whole document.
>
> Just for interest, how do you do that?
>



-- 
*N.S.KARTHIK
R.M.S.COLONY
BEHIND BANK OF INDIA
R.M.V 2ND STAGE
BANGALORE
560094*