MultiSearcher query with Sort option

2009-04-10 Thread Preetham Kajekar

Hi,
I am using a MultiSearcher to search 2 indexes. As part of my query, I 
am sorting the results based on a field (which in NOT_ANALYSED). 
However, i seem to be getting hits only from one of the indexes. If I 
change to Sort.INDEX_ORDER, I seem to be getting results from both. Is 
this a know problem ?


Thanks,
~preetham

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: MultiSearcher query with Sort option

2009-04-10 Thread Uwe Schindler
Hallo Preetham,

never heard of this. What Lucene version do you use?
To check out, try the search in andifferent way:
Combine the two indexes not into a MultiSearcher, instead open an
IndexReader for both indexes and combine both readers to a MultiReader. This
MultiReader can be used like a conventional single index and searched with
IndexSearcher. If the error then disappears, there may be a bug. If not,
something with your indexes is wrong.

I always recommend to only use MultiSearcher in distributed or parallel
search scenarios, never for just combining two indexes.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Preetham Kajekar [mailto:preet...@cisco.com]
> Sent: Friday, April 10, 2009 9:43 AM
> To: java-user@lucene.apache.org
> Subject: MultiSearcher query with Sort option
> 
> Hi,
>  I am using a MultiSearcher to search 2 indexes. As part of my query, I
> am sorting the results based on a field (which in NOT_ANALYSED).
> However, i seem to be getting hits only from one of the indexes. If I
> change to Sort.INDEX_ORDER, I seem to be getting results from both. Is
> this a know problem ?
> 
> Thanks,
>  ~preetham
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SpellChecker AlreadyClosedException issue

2009-04-10 Thread John Cherouvim

dir is a local variable inside a method, so it's not getting reused.
Should I synchronise the whole method? I think that would slow things 
down in a concurrent environment.


Thanks for your response.


Chris Hostetter wrote:

: My code looks like this:
: 
: Directory dir = null;

: try {
:dir = FSDirectory.getDirectory("/path/to/dictionary");
:SpellChecker spell = new SpellChecker(dir); // exception thrown here
:// ...
:dir.close();

: This code works, but in a highly concurrent situation AlreadyClosedException
: is being thrown when I try to instantiate the SpellChecker:
: org.apache.lucene.store.AlreadyClosedException: this Directory is closed

if an error only happens under high concurrent load, it suggests that 
perhaps you have multiple threads attempting to close the directory.  you 
haven't clarified whether "dir" is a local variable inside a method, or an 
instnace variable in an object which is getting reused by multiple 
threads -- so it's hard to guess.


: I use lucene-core-2.4.1.jar and lucene-spellchecker-2.4.1.jar and I can
: reproduce the error in both windows and linux.

if you have a fully exeuctable test case (instead of just an incomplete 
partial snippet) that you can share, people may be able to spot the 
problem, or at the very least run the test themselves to reproduce.



-Hoss


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  


--
Ioannis Cherouvim
Software Engineer
mail: j...@eworx.gr
web: www.eworx.gr


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: MultiSearcher query with Sort option

2009-04-10 Thread Uwe Schindler
It should, do not use Sort.INDEX_ORDER, create a SortField with indexorder
and the reverse parameter, the SortField can be warpped inside a Sort
instance and voila. I am not sure, if it works, but it should. Same with
score.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Preetham Kajekar [mailto:preet...@cisco.com]
> Sent: Friday, April 10, 2009 11:27 AM
> To: java-user@lucene.apache.org
> Subject: Re: MultiSearcher query with Sort option
> 
> Hi,
>  I just realized it was a bug in my code.
>  On a related note, is it possible to Sort based on reverse index order ?
> 
> Thanks,
>  ~preetham
> 
> Uwe Schindler wrote:
> > Hallo Preetham,
> >
> > never heard of this. What Lucene version do you use?
> > To check out, try the search in andifferent way:
> > Combine the two indexes not into a MultiSearcher, instead open an
> > IndexReader for both indexes and combine both readers to a MultiReader.
> This
> > MultiReader can be used like a conventional single index and searched
> with
> > IndexSearcher. If the error then disappears, there may be a bug. If not,
> > something with your indexes is wrong.
> >
> > I always recommend to only use MultiSearcher in distributed or parallel
> > search scenarios, never for just combining two indexes.
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> >
> >> -Original Message-
> >> From: Preetham Kajekar [mailto:preet...@cisco.com]
> >> Sent: Friday, April 10, 2009 9:43 AM
> >> To: java-user@lucene.apache.org
> >> Subject: MultiSearcher query with Sort option
> >>
> >> Hi,
> >>  I am using a MultiSearcher to search 2 indexes. As part of my query, I
> >> am sorting the results based on a field (which in NOT_ANALYSED).
> >> However, i seem to be getting hits only from one of the indexes. If I
> >> change to Sort.INDEX_ORDER, I seem to be getting results from both. Is
> >> this a know problem ?
> >>
> >> Thanks,
> >>  ~preetham
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> >
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Exceptions in merge thread (while optimizing) causing problems with subsequent reopens

2009-04-10 Thread Michael McCandless
Actually it's perfectly fine for two threads to enter that code
fragment (you obtain a write lock to protect the code so that "there
can be only one").

Second off, even if you didn't have your write lock, the code should
still be safe in that no index corruption is possible.  Multiple
threads may call optimize(), commit() etc. on an IndexWriter without
harm.

The reopen code is also "safe" (will not cause corruption), but you
may accidentally have readers that you fail to close, or close readers
that are in-use by in-flight searches).  I'd recommend using the
"lia.admin.SearcherManager" class from the upcoming Lucene in Action
revision (it's in the book's source code, which you can download from
http://www.manning.com/hatcher3/LIAsourcecode.zip) to manage/reopen
the searcher.

But one surefire way to cause index corruption is if two separate
IndexWriters are open on the same index.  This is normally not easy to
do, since Lucene protects itself with the write lock in the index
directory.  So if you 1) turn off this locking (eg use NoLockFactory),
and 2) accidentally allow two writers on once on the same index,
you'll get corruption.

So I'm not sure that we've actually explained your corruption?

Mike

On Fri, Apr 10, 2009 at 12:42 AM, Khawaja Shams  wrote:
> Mike,
>  I am sorry for wasting your time :). There were indeed two threads that
> were performing this operation. Out of curiosity, which part of this is not
> thread safe? An indexreader reopening while a commit is going on? Thanks
> again for your help.
>
> Regards,
> Khawaja
>
>
>
> On Thu, Apr 9, 2009 at 5:44 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> That code looks right.  Are there multiple threads that may enter it?
>>
>> Can you show the code where you create the IndexWriter, add docs, etc?
>>
>> Can you call IndexWriter.setInfoStream for the entire life of the
>> index, up until when the optimize error happens, and post back?
>>
>> Mike
>>
>> On Thu, Apr 9, 2009 at 8:33 PM, Khawaja Shams  wrote:
>> > Hi Michael,
>> >  Thanks for the quick response. I only have one IndexWriter, and there
>> are
>> > no other processes accessing this particular index. I have tried deleting
>> > the entire index and reconstructing it, but the index corruption is
>> > repeatable. Incidentally, there are no new writes since the last commit
>> when
>> > the merge happens. I have over-padded my code with ReadWrite locks to
>> make
>> > sure that no writes/read are happening between the commits,
>> optimizations,
>> > and reopening of the index.
>> >
>> >
>> > Here is a snippet of the thread I use to maintain the Index ( I hope that
>> I
>> > am not doing something terribly wrong):
>> >            while (true) {
>> >                try {
>> >                    getWriteLock();
>> >                    indexWriter.commit();
>> >                   if (shouldOptimize()) {
>> >                       indexWriter.optimize();
>> >                    }
>> >
>> >                    IndexReader oldIR = indexSearcher.getIndexReader();
>> >                    IndexReader ir = oldIR.reopen();
>> >                    if (ir != oldIR) {
>> >                        IndexSearcher oldIS = indexSearcher;
>> >                        indexSearcher = new IndexSearcher(ir);
>> >                        oldIS.close();
>> >                        oldIR.close();
>> >                } catch (Throwable t) {
>> >                    trace.error(t, t);
>> >                } finally {
>> >                    releaseWriteLock();
>> >                }
>> >
>> >
>> > Regards,
>> > Khawaja
>> >
>> > On Thu, Apr 9, 2009 at 5:05 PM, Michael McCandless <
>> > luc...@mikemccandless.com> wrote:
>> >
>> >> These are serious corruption exceptions.
>> >>
>> >> Is it at all possible two writers are accessing the index at the same
>> time?
>> >>
>> >> Can you describe more about how you're using Lucene?
>> >>
>> >> Mike
>> >>
>> >> On Thu, Apr 9, 2009 at 7:59 PM, Khawaja Shams 
>> wrote:
>> >> > Hello,
>> >> >  I am having a problem with reopening the IndexReader with Lucene 2.4
>> ( I
>> >> > updated to 2.4.1, but still no luck). The exception is preceded by an
>> >> > exception in optimizing the index. I am not reopening the reader while
>> >> the
>> >> > commit or optimization is going on in the writer (optimizing happens
>> in
>> >> the
>> >> > same thread, but much less often). The issues go away once I turn off
>> >> > optimizations. I was also getting this problem before I turned off the
>> >> use
>> >> > of compound files. I would appreciate any guidance.
>> >> >
>> >> > Thanks!
>> >> >
>> >> > Regards,
>> >> > Khawaja
>> >> >
>> >> >
>> >> > 2009-04-09 15:57:47,033 (941820) [Index Maint Thread] ERROR
>> >> > gov.nasa.ensemble.core.indexer.Indexer  - java.io.IOException:
>> background
>> >> > merge hit exception: _8:C41258 _9:C11382 into _a [optimize]
>> >> > java.io.IOException: background merge hit exception: _8:C41258
>> _9:C11382
>> >> > into _a [optimize]
>> >> >    at
>> 

Re: MultiSearcher query with Sort option

2009-04-10 Thread Michael McCandless
This (reversing a SortField.FIELD_DOC) should work... if it doesn't it's a bug.

SortField.FIELD_DOC and SortField.FIELD_SCORE are "first class"
SortField objects.

Mike

On Fri, Apr 10, 2009 at 5:31 AM, Uwe Schindler  wrote:
> It should, do not use Sort.INDEX_ORDER, create a SortField with indexorder
> and the reverse parameter, the SortField can be warpped inside a Sort
> instance and voila. I am not sure, if it works, but it should. Same with
> score.
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>> -Original Message-
>> From: Preetham Kajekar [mailto:preet...@cisco.com]
>> Sent: Friday, April 10, 2009 11:27 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: MultiSearcher query with Sort option
>>
>> Hi,
>>  I just realized it was a bug in my code.
>>  On a related note, is it possible to Sort based on reverse index order ?
>>
>> Thanks,
>>  ~preetham
>>
>> Uwe Schindler wrote:
>> > Hallo Preetham,
>> >
>> > never heard of this. What Lucene version do you use?
>> > To check out, try the search in andifferent way:
>> > Combine the two indexes not into a MultiSearcher, instead open an
>> > IndexReader for both indexes and combine both readers to a MultiReader.
>> This
>> > MultiReader can be used like a conventional single index and searched
>> with
>> > IndexSearcher. If the error then disappears, there may be a bug. If not,
>> > something with your indexes is wrong.
>> >
>> > I always recommend to only use MultiSearcher in distributed or parallel
>> > search scenarios, never for just combining two indexes.
>> >
>> > Uwe
>> >
>> > -
>> > Uwe Schindler
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > http://www.thetaphi.de
>> > eMail: u...@thetaphi.de
>> >
>> >
>> >> -Original Message-
>> >> From: Preetham Kajekar [mailto:preet...@cisco.com]
>> >> Sent: Friday, April 10, 2009 9:43 AM
>> >> To: java-user@lucene.apache.org
>> >> Subject: MultiSearcher query with Sort option
>> >>
>> >> Hi,
>> >>  I am using a MultiSearcher to search 2 indexes. As part of my query, I
>> >> am sorting the results based on a field (which in NOT_ANALYSED).
>> >> However, i seem to be getting hits only from one of the indexes. If I
>> >> change to Sort.INDEX_ORDER, I seem to be getting results from both. Is
>> >> this a know problem ?
>> >>
>> >> Thanks,
>> >>  ~preetham
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MultiSearcher query with Sort option

2009-04-10 Thread Preetham Kajekar

Hi Uwe,
Thanks for your response. However, I could not find the API in 
SortField and Sort to achieve this. SortField can be wrapped inside a 
Sort, but you cannot specify to reverse the order .


Thx,
~preetham


Uwe Schindler wrote:

It should, do not use Sort.INDEX_ORDER, create a SortField with indexorder
and the reverse parameter, the SortField can be warpped inside a Sort
instance and voila. I am not sure, if it works, but it should. Same with
score.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  

-Original Message-
From: Preetham Kajekar [mailto:preet...@cisco.com]
Sent: Friday, April 10, 2009 11:27 AM
To: java-user@lucene.apache.org
Subject: Re: MultiSearcher query with Sort option

Hi,
 I just realized it was a bug in my code.
 On a related note, is it possible to Sort based on reverse index order ?

Thanks,
 ~preetham

Uwe Schindler wrote:


Hallo Preetham,

never heard of this. What Lucene version do you use?
To check out, try the search in andifferent way:
Combine the two indexes not into a MultiSearcher, instead open an
IndexReader for both indexes and combine both readers to a MultiReader.
  

This


MultiReader can be used like a conventional single index and searched
  

with


IndexSearcher. If the error then disappears, there may be a bug. If not,
something with your indexes is wrong.

I always recommend to only use MultiSearcher in distributed or parallel
search scenarios, never for just combining two indexes.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


  

-Original Message-
From: Preetham Kajekar [mailto:preet...@cisco.com]
Sent: Friday, April 10, 2009 9:43 AM
To: java-user@lucene.apache.org
Subject: MultiSearcher query with Sort option

Hi,
 I am using a MultiSearcher to search 2 indexes. As part of my query, I
am sorting the results based on a field (which in NOT_ANALYSED).
However, i seem to be getting hits only from one of the indexes. If I
change to Sort.INDEX_ORDER, I seem to be getting results from both. Is
this a know problem ?

Thanks,
 ~preetham

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MultiSearcher query with Sort option

2009-04-10 Thread Preetham Kajekar

Hi,
I just realized it was a bug in my code.
On a related note, is it possible to Sort based on reverse index order ?

Thanks,
~preetham

Uwe Schindler wrote:

Hallo Preetham,

never heard of this. What Lucene version do you use?
To check out, try the search in andifferent way:
Combine the two indexes not into a MultiSearcher, instead open an
IndexReader for both indexes and combine both readers to a MultiReader. This
MultiReader can be used like a conventional single index and searched with
IndexSearcher. If the error then disappears, there may be a bug. If not,
something with your indexes is wrong.

I always recommend to only use MultiSearcher in distributed or parallel
search scenarios, never for just combining two indexes.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  

-Original Message-
From: Preetham Kajekar [mailto:preet...@cisco.com]
Sent: Friday, April 10, 2009 9:43 AM
To: java-user@lucene.apache.org
Subject: MultiSearcher query with Sort option

Hi,
 I am using a MultiSearcher to search 2 indexes. As part of my query, I
am sorting the results based on a field (which in NOT_ANALYSED).
However, i seem to be getting hits only from one of the indexes. If I
change to Sort.INDEX_ORDER, I seem to be getting results from both. Is
this a know problem ?

Thanks,
 ~preetham

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MultiSearcher query with Sort option

2009-04-10 Thread Preetham Kajekar

Hi,
I found the API in another post on the net.

new *Sort*(new SortField(null, SortField.DOC, true))

The trick is to set the field to null.

Thanks for the help.

Preetham Kajekar wrote:

Hi Uwe,
Thanks for your response. However, I could not find the API in 
SortField and Sort to achieve this. SortField can be wrapped inside a 
Sort, but you cannot specify to reverse the order .


Thx,
~preetham


Uwe Schindler wrote:
It should, do not use Sort.INDEX_ORDER, create a SortField with 
indexorder

and the reverse parameter, the SortField can be warpped inside a Sort
instance and voila. I am not sure, if it works, but it should. Same with
score.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 

-Original Message-
From: Preetham Kajekar [mailto:preet...@cisco.com]
Sent: Friday, April 10, 2009 11:27 AM
To: java-user@lucene.apache.org
Subject: Re: MultiSearcher query with Sort option

Hi,
 I just realized it was a bug in my code.
 On a related note, is it possible to Sort based on reverse index 
order ?


Thanks,
 ~preetham

Uwe Schindler wrote:
   

Hallo Preetham,

never heard of this. What Lucene version do you use?
To check out, try the search in andifferent way:
Combine the two indexes not into a MultiSearcher, instead open an
IndexReader for both indexes and combine both readers to a 
MultiReader.
  

This
   

MultiReader can be used like a conventional single index and searched
  

with
   
IndexSearcher. If the error then disappears, there may be a bug. If 
not,

something with your indexes is wrong.

I always recommend to only use MultiSearcher in distributed or 
parallel

search scenarios, never for just combining two indexes.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 

-Original Message-
From: Preetham Kajekar [mailto:preet...@cisco.com]
Sent: Friday, April 10, 2009 9:43 AM
To: java-user@lucene.apache.org
Subject: MultiSearcher query with Sort option

Hi,
 I am using a MultiSearcher to search 2 indexes. As part of my 
query, I

am sorting the results based on a field (which in NOT_ANALYSED).
However, i seem to be getting hits only from one of the indexes. If I
change to Sort.INDEX_ORDER, I seem to be getting results from 
both. Is

this a know problem ?

Thanks,
 ~preetham

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query any data

2009-04-10 Thread Matthew Hall
I think I would tackle this in a slightly different manner.

When you are creating this index, make sure that that field has a
default value. Make sure this value is something that could never appear
in the index otherwise. Then, when you goto place this field into the
index, either write out your actual value, or the default one.

Then when you get the document back, you can look at that field, and
solve your question. You can also craft queries that specifically avoid
entries that don't have a value in this field with a not clause.

Hope this helps,

Matt

Erick Erickson wrote:
> searching for fieldname:* will be *extremely* expensive as it will, by
> default,
> build a giant OR clause consisting of every term in the field. You'll throw
> MaxClauses exceptions right and left. I'd follow Tim's thread lead first
>
> Best
> Erick
>
> 2009/4/8 王巍巍 
>
>   
>> first you should change your querypaser to accept wildcard query by calling
>> method of QueryParser
>>  setAllowLeadingWildcard
>> then you can query like this:  fieldname:*
>>
>> 2009/4/9 Tim Williams 
>>
>> 
>>> On Wed, Apr 8, 2009 at 11:45 AM, addman  wrote:
>>>   
 Hi,
   Is it possible to create a query to search a field for any value?  I
 
>>> just
>>>   
 need to know if the optional field contain any data at all.
 
>>> google for:  lucene field existence
>>>
>>> There's no way built in, one strategy[1] is to have a 'meta field'
>>> that contains the names of the fields the document contains.
>>>
>>> --tim
>>>
>>> [1] -
>>> http://www.mail-archive.com/lucene-u...@jakarta.apache.org/msg07703.html
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>   
>> --
>> 王巍巍(Weiwei Wang)
>> Department of Computer Science
>> Gulou Campus of Nanjing University
>> Nanjing, P.R.China, 210093
>>
>> Mobile: 86-13913310569
>> MSN: ww.wang...@gmail.com
>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>
>> 
>
>   


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



SpellChecker in use with composite query

2009-04-10 Thread Amin Mohammed-Coleman
Hi
I have been playing around with the SpellChecker class and so far it looks
really good.  While developing a testcase to show it working I came across a
couple of issues which I have resolved but I'm not certain if this is the
correct approach.  I would therefore be grateful if anyone could tell me
whether it is correct or I should try something else.

1) Multple Indexes:
I have multiple indexes which store different documents based on certain
subject matter.  So inorder to perform the spellchecking against all indexes
I did something like this:

IndexReader spellReader = IndexReader.open(fsDirectory1);

IndexReader spellReader2 = IndexReader.open(fsDirectory2);

MultiReader multiReader = new MultiReader(new IndexReader[]
{spellReader,spellReader2});

LuceneDictionary luceneDictionary = new LuceneDictionary(multiReader,
"content");

Directory spellDirectory = FSDirectory.getDirectory(

Re: Query any data

2009-04-10 Thread Tim Williams
2009/4/10 Matthew Hall :
> I think I would tackle this in a slightly different manner.
>
> When you are creating this index, make sure that that field has a
> default value. Make sure this value is something that could never appear
> in the index otherwise. Then, when you goto place this field into the
> index, either write out your actual value, or the default one.
>
> Then when you get the document back, you can look at that field, and
> solve your question. You can also craft queries that specifically avoid
> entries that don't have a value in this field with a not clause.

I think this is limited by...

... not being able to [easily] add new fields over time... you'd have
to reindex all documents (to insert the new magic token) just to add a
new field.

... requiring additional manipulation for appendable, updateable
fields... when you append new data to a field, you'd have to go in and
remove the special token.

--tim

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Wordnet indexing error

2009-04-10 Thread Sudarsan, Sithu D.
 
Thanks Otis,

Yes, we figured that out! Since, we do not intend to migrate to 2.4 yet,
we used the syns2index source code from svn. The problem is now taken
care.

This part is for all: 

This brings us to the next question: 

1. Is there some contrib code available for using hypernyms and such, in
addition to synonyms from Wordnet?
2. Is there some code to add user defined dictionary/ontology, as an
additional layer to Wordnet (some sort of multi-level)?

Thank you all in advance,

Sincerely,
Sithu 

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Thursday, April 09, 2009 1:06 AM
To: java-user@lucene.apache.org
Subject: Re: Wordnet indexing error


Hi,

The simplest thing to do is to grab the latest Lucene and the latest jar
for that Wordnet (syns2index) code.  That should work for you (that
UnIndexed method is an old method that doesn't exist any more).


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: "Sudarsan, Sithu D." 
> To: java-user@lucene.apache.org
> Sent: Wednesday, April 8, 2009 7:01:16 PM
> Subject: Wordnet indexing error
> 
> Hi All,
> 
> We're using Lucene 2.3.2 on Windows. When we try to generate index for
> WordNet2.0 using Syns2Index class, while indexing, the following error
> is thrown:
> 
> Java.lang.NoSuchMethodError:
>
org.apache.lucene.document.Field.UnIndexed(Ljava/lang/String;Ljava/lang/
> String;)Lorg/apache/lucene/document/Field;
> 
> Our code is looks like this:
> 
> String[] filelocations = {"path/to/prolog/file", "path/to/index"};
> try{
>  Syns2Index.main(filelocations);
> } catch 
> 
> 
> The error typically happens at about line number 13 in the wn_s.pl
> file.
> 
> No luck with WordNet3.0 as well. We get the same error.
> 
> Any fix or solutions? 
> 
> Thanks in advance,
> Sithu D Sudarsan
> 
> sithu.sudar...@fda.hhs.gov
> sdsudar...@ualr.edu


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RangeFilter performance problem using MultiReader

2009-04-10 Thread Raf
Hi,
we are experiencing some problems using RangeFilters and we think there are
some performance issues caused by MultiReader.

We have more or less 3M documents in 24 indexes and we read all of them
using a MultiReader.
If we do a search using only terms, there are no problems, but it if we add
to the same search terms a RangeFilter that extracts a large subset of the
documents (e.g. 500K), it takes a lot of time to execute (about 15s).

In order to identify the problem, we have tried to consolidate the index: so
now we have the same 3M docs in a single 10GB index.
If we repeat the same search using this index, it takes only a small
fraction of the previous time (about 2s).

Is there something we can do to improve search performance using
RangeFilters with MultiReader or the only solution is to have only a single
big index?

Thanks,
Raf


Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Michael McCandless
Unfortunately, in Lucene 2.4, any query that needs to enumerate Terms
(Prefix, Wildcard, Range, etc.) has poor performance on Multi*Readers.
 I think the only workaround is to merge your indexes down to a single
index.

But, Lucene trunk (not yet released) has fixed this, so that searching
through your MultiReader should give you the same performance as
searching on a single consolidated index -- if you test this (which
would be awesome!) please report back and let us know how it went.

Mike

On Fri, Apr 10, 2009 at 10:38 AM, Raf  wrote:
> Hi,
> we are experiencing some problems using RangeFilters and we think there are
> some performance issues caused by MultiReader.
>
> We have more or less 3M documents in 24 indexes and we read all of them
> using a MultiReader.
> If we do a search using only terms, there are no problems, but it if we add
> to the same search terms a RangeFilter that extracts a large subset of the
> documents (e.g. 500K), it takes a lot of time to execute (about 15s).
>
> In order to identify the problem, we have tried to consolidate the index: so
> now we have the same 3M docs in a single 10GB index.
> If we repeat the same search using this index, it takes only a small
> fraction of the previous time (about 2s).
>
> Is there something we can do to improve search performance using
> RangeFilters with MultiReader or the only solution is to have only a single
> big index?
>
> Thanks,
> Raf
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Yonik Seeley
On Fri, Apr 10, 2009 at 10:48 AM, Michael McCandless
 wrote:
> Unfortunately, in Lucene 2.4, any query that needs to enumerate Terms
> (Prefix, Wildcard, Range, etc.) has poor performance on Multi*Readers.

Do we know why this is, and if it's fixable (the MultiTermEnum, not
the higher level query objects)?  Is it simply the maintenance of the
priority queue, or something else?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Help to determine why an optimized index is proportionaly too big.

2009-04-10 Thread Andrzej Bialecki

Chris Hostetter wrote:
: The second stage index failed an optimization with a disk full exception 
: (I had to move it to another lucene machine with a larger disk partition 
: to complete the optimization. Is there a reason why a 22 day index would 
: be 10x the size of an 8 day index when the document indexing rate is 
: fairly constant? Also, is there a way to shrink the index without 
: regenerating it?


did you run CheckIndex after it failed to optimize the first time?  the 
failure may have left old temp files arround that aren't actually part of 
the index but are taking up space. 

(Actually: does CheckIndex warn about unused files in the index directory 
so people can clean them up? i'm not sure)


It doesn't. But Luke has a function to do this.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Raf
Hi Mike,
thank you for your answer.

I have downloaded lucene-core-2.9-dev and I have executed my tests (both on
multireader and on consolidated index) using this new version, but the
performance are very similar to the previous ones.
The big index is 7/8 times faster than multireader version.

Raf

On Fri, Apr 10, 2009 at 4:48 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Unfortunately, in Lucene 2.4, any query that needs to enumerate Terms
> (Prefix, Wildcard, Range, etc.) has poor performance on Multi*Readers.
>  I think the only workaround is to merge your indexes down to a single
> index.
>
> But, Lucene trunk (not yet released) has fixed this, so that searching
> through your MultiReader should give you the same performance as
> searching on a single consolidated index -- if you test this (which
> would be awesome!) please report back and let us know how it went.
>
> Mike
>
> On Fri, Apr 10, 2009 at 10:38 AM, Raf  wrote:
> > Hi,
> > we are experiencing some problems using RangeFilters and we think there
> are
> > some performance issues caused by MultiReader.
> >
> > We have more or less 3M documents in 24 indexes and we read all of them
> > using a MultiReader.
> > If we do a search using only terms, there are no problems, but it if we
> add
> > to the same search terms a RangeFilter that extracts a large subset of
> the
> > documents (e.g. 500K), it takes a lot of time to execute (about 15s).
> >
> > In order to identify the problem, we have tried to consolidate the index:
> so
> > now we have the same 3M docs in a single 10GB index.
> > If we repeat the same search using this index, it takes only a small
> > fraction of the previous time (about 2s).
> >
> > Is there something we can do to improve search performance using
> > RangeFilters with MultiReader or the only solution is to have only a
> single
> > big index?
> >
> > Thanks,
> > Raf
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Lucene SnowBall unexpected behavior for some terms

2009-04-10 Thread AlexElba

Hello,
I was working with lucene snowball 2.3.2 and I switch to 2.4.0. 
After switch I came by to some case where lucene doesn't do lemmatization
correctly. So far I found only one case spa - spas. spas are not getting
lemmatize at all...
BTW I saw the same behavior on solr 1.3


Anybody have any idea why?
-- 
View this message in context: 
http://www.nabble.com/Lucene-SnowBall-unexpected-behavior-for-some-terms-tp22991689p22991689.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Michael McCandless
On Fri, Apr 10, 2009 at 11:03 AM, Yonik Seeley
 wrote:
> On Fri, Apr 10, 2009 at 10:48 AM, Michael McCandless
>  wrote:
>> Unfortunately, in Lucene 2.4, any query that needs to enumerate Terms
>> (Prefix, Wildcard, Range, etc.) has poor performance on Multi*Readers.
>
> Do we know why this is, and if it's fixable (the MultiTermEnum, not
> the higher level query objects)?  Is it simply the maintenance of the
> priority queue, or something else?

We never fully explained it, but we have some ideas...

It's only if you iterate each term, and do a TermDocs.seek for each,
that Multi*Reader seems to show the problem.  Just iterating the terms
seems OK (I have a 51 segment index, and I can iterate ~ 10M unique
terms in ~8 seconds).

But loading FieldCache, or doing eg RangeQuery, also does a
MultiTermDocs.seek on each term, which in turn calls
SegmentTermDocs.seek for each of the sub-readers in sequence.  I
*think* maybe for highly unique terms, where typically all segments
but one actually have the term, the cost of invoking seek on those
segments without the term is high.  Really, somehow, we want to only
call seek on those segments that have the term, which we know from the
pqueue...

Mike

> -Yonik
> http://www.lucidimagination.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Michael McCandless
On Fri, Apr 10, 2009 at 1:20 PM, Raf  wrote:
> Hi Mike,
> thank you for your answer.
>
> I have downloaded lucene-core-2.9-dev and I have executed my tests (both on
> multireader and on consolidated index) using this new version, but the
> performance are very similar to the previous ones.
> The big index is 7/8 times faster than multireader version.

Hmmm, interesting!

Can you provide more details about your tests?  EG the code fragment
showing your query, the creation of the MultiReader, how you run the
search, etc.?

Is the field that you're applying the RangeFilter on highly unique or
rather redundant?

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller
When I did some profiling I saw that the slow down came from tons of 
extra seeks (single segment vs multisegment). What was happening was, 
the first couple segments would have thousands of terms for the field, 
but as the segments logarithmically shrank in size, the number of terms 
for the segment would drop dramatically - you basically end up with a 
long tail eg 5000 4000 200 200 5 5 2. Because loading the field cache 
would enumerate every term it would end up calling seek 5000 times 
against each segment - that appeared to be the slowdown for me.


We fixed this with 1483 because we load the fieldcache per segment and 
so instead of calling seek 5000 times for each segment, you call 5000 
for the fist, 4000 for the next, then 200, 200, 5 and 5. Can add up to 
huge savings do the long tail of low term segments.


I had thought we would also see the advantage with multi-term queries - 
you rewrite against each segment and avoid extra seeks (though not 
nearly as many as when enumerating every term). As Mike pointed out to 
me back when though : we still rewrite against the multi-reader and so 
see no real savings here. Unfortunately.


- Mark



Do we know why this is, and if it's fixable (the MultiTermEnum, not
the higher level query objects)?  Is it simply the maintenance of the
priority queue, or something else?



We never fully explained it, but we have some ideas...

It's only if you iterate each term, and do a TermDocs.seek for each,
that Multi*Reader seems to show the problem.  Just iterating the terms
seems OK (I have a 51 segment index, and I can iterate ~ 10M unique
terms in ~8 seconds).

But loading FieldCache, or doing eg RangeQuery, also does a
MultiTermDocs.seek on each term, which in turn calls
SegmentTermDocs.seek for each of the sub-readers in sequence.  I
*think* maybe for highly unique terms, where typically all segments
but one actually have the term, the cost of invoking seek on those
segments without the term is high.  Really, somehow, we want to only
call seek on those segments that have the term, which we know from the
pqueue...

Mike
  


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Michael McCandless
On Fri, Apr 10, 2009 at 2:32 PM, Mark Miller  wrote:

> I had thought we would also see the advantage with multi-term queries - you
> rewrite against each segment and avoid extra seeks (though not nearly as
> many as when enumerating every term). As Mike pointed out to me back when
> though : we still rewrite against the multi-reader and so see no real
> savings here. Unfortunately.

But, RangeQuery.rewrite is simply enumerating terms, which I think is
working "OK".

It's enumerting terms, then seeking a sister TermDocs to each term,
that tickles the over-seeking problem.  FieldCache does that, and
RangeFilter on 2.4 does that, but RangeFilter (or RangeQuery with
constant score mode) on 2.9 should not (they should do it per
segment), which is why I'm baffled that Raf didn't see a speedup on
upgrading.

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller

Michael McCandless wrote:

On Fri, Apr 10, 2009 at 2:32 PM, Mark Miller  wrote:

  

I had thought we would also see the advantage with multi-term queries - you
rewrite against each segment and avoid extra seeks (though not nearly as
many as when enumerating every term). As Mike pointed out to me back when
though : we still rewrite against the multi-reader and so see no real
savings here. Unfortunately.



But, RangeQuery.rewrite is simply enumerating terms, which I think is
working "OK".

It's enumerting terms, then seeking a sister TermDocs to each term,
that tickles the over-seeking problem.  FieldCache does that, and
RangeFilter on 2.4 does that, but RangeFilter (or RangeQuery with
constant score mode) on 2.9 should not (they should do it per
segment), which is why I'm baffled that Raf didn't see a speedup on
upgrading.

Mike
Ah, right - anything utilizing a filter will see the gain. It wouldn't 
be such a big gain unless there were a *lot* of matching terms though 
right? Fieldcache is so bad because its every term. A smaller percentage 
of terms for a field won't be nearly the problem.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller

Michael McCandless wrote:


which is why I'm baffled that Raf didn't see a speedup on
upgrading.

Mike
  
Another point is that he may not have such a nasty set of segments - Raf 
says he has 24 indexes, which sounds like he may not have the 
logarithmic sizing you normally see. If you have somewhat normal term 
distribution for all 24 segments, the problem is not exasperated nearly 
as much (along with not being so bad as its not using all of the terms 
for the field).


24 segments is bound to be quite a bit slower than an optimized index 
for most things - also 24 segments of similar size may also be worse 
than the normal 24 segments with log dropping size.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller

Mark Miller wrote:

Michael McCandless wrote:


which is why I'm baffled that Raf didn't see a speedup on
upgrading.

Mike
  
Another point is that he may not have such a nasty set of segments - 
Raf says he has 24 indexes, which sounds like he may not have the 
logarithmic sizing you normally see. If you have somewhat normal term 
distribution for all 24 segments, the problem is not exasperated 
nearly as much (along with not being so bad as its not using all of 
the terms for the field).
Better clarify this: it will still be a problem - you still have all the 
extra seeks - but they are not as many wasted seeks that we can avoid 
like the problem with the tailed logarithmic segments.


24 segments is bound to be quite a bit slower than an optimized index 
for most things - also 24 segments of similar size may also be worse 
than the normal 24 segments with log dropping size.





--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller

Raf wrote:


We have more or less 3M documents in 24 indexes and we read all of them
using a MultiReader.
  


Is this a multireader containing multireaders?

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Michael McCandless
On Fri, Apr 10, 2009 at 3:14 PM, Mark Miller  wrote:
> Raf wrote:
>>
>> We have more or less 3M documents in 24 indexes and we read all of them
>> using a MultiReader.
>>
>
> Is this a multireader containing multireaders?

Let's hear Raf's answer, but I think likely "yes".  But this shouldn't
be a problem because we recursively expand down to the segment readers
in IndexSearcher.gatherSubReaders.

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Michael McCandless
On Fri, Apr 10, 2009 at 3:11 PM, Mark Miller  wrote:
> Mark Miller wrote:
>>
>> Michael McCandless wrote:
>>>
>>> which is why I'm baffled that Raf didn't see a speedup on
>>> upgrading.
>>>
>>> Mike
>>>
>>
>> Another point is that he may not have such a nasty set of segments - Raf
>> says he has 24 indexes, which sounds like he may not have the logarithmic
>> sizing you normally see. If you have somewhat normal term distribution for
>> all 24 segments, the problem is not exasperated nearly as much (along with
>> not being so bad as its not using all of the terms for the field).
>
> Better clarify this: it will still be a problem - you still have all the
> extra seeks - but they are not as many wasted seeks that we can avoid like
> the problem with the tailed logarithmic segments.

Right, I think "uniqueness" of terms may be the driving factor.  So,
if segment sizes are all the same (no logarithmic tail), but terms are
very unique, you'll still have N-1 SegmentTermEnums trying to seek to
a term that they don't have.

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Michael McCandless
On Fri, Apr 10, 2009 at 3:06 PM, Mark Miller  wrote:

> 24 segments is bound to be quite a bit slower than an optimized index for
> most things

I'd be curious just how true this really is (in general)... my guess
is the "long tail of tiny segments" gets into the OS's IO cache (as
long as the system stays hot) and doesn't actually hurt things much.

Has anyone tested this (performance of unoptimized vs optimized
indexes, in general) recently?  To be a fair comparison, there should
be no deletions in the index.

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



exponential boosts

2009-04-10 Thread Steven Bethard
I need to have a scoring model of the form:

s1(d, q)^a1 * s2(d, q)^a2 * ... * sN(d, q)^aN

where "d" is a document, "q" is a query, "sK" is a scoring function, and
"aK" is the exponential boost factor for that scoring function. As a
simple example, I might have:

s1 = TF-IDF score matching "text" field (e.g. a TermQuery)
a1 = 1.0

s2 = TF-IDF score matching "author" field (e.g. a TermQuery)
a2 = 0.1

s3 = PageRank score (e.g. a FieldScoreQuery)
a3 = 0.5

It's important that the "aK" parameters are exponents in the scoring
function and not just multipliers because it allows me to do a
particular kind of optimized search for the best parameter values.

How can I achieve this? My first thought was just that I should set the
boost factor for each query, but the boost factor is just a multiplier,
right?

My second thought was to subclass CustomScoreQuery and override
customScore, but as far as I can tell, CustomScoreQuery can only combine
a Query with a ValueSourceQuery, while I need to combine a Query with
another Query (e.g. the example above with two TermQuery scores).

How should I go about this?

Thanks in advance,

Steve

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: exponential boosts

2009-04-10 Thread Jack Stahl
Perhaps you'd find it easier to implement the equivalent:

log(s1(d, q))*a1 + ... + log(sN(d, q))*aN

On Fri, Apr 10, 2009 at 12:56 PM, Steven Bethard wrote:

> I need to have a scoring model of the form:
>
>s1(d, q)^a1 * s2(d, q)^a2 * ... * sN(d, q)^aN
>
> where "d" is a document, "q" is a query, "sK" is a scoring function, and
> "aK" is the exponential boost factor for that scoring function. As a
> simple example, I might have:
>
>s1 = TF-IDF score matching "text" field (e.g. a TermQuery)
>a1 = 1.0
>
>s2 = TF-IDF score matching "author" field (e.g. a TermQuery)
>a2 = 0.1
>
>s3 = PageRank score (e.g. a FieldScoreQuery)
>a3 = 0.5
>
> It's important that the "aK" parameters are exponents in the scoring
> function and not just multipliers because it allows me to do a
> particular kind of optimized search for the best parameter values.
>
> How can I achieve this? My first thought was just that I should set the
> boost factor for each query, but the boost factor is just a multiplier,
> right?
>
> My second thought was to subclass CustomScoreQuery and override
> customScore, but as far as I can tell, CustomScoreQuery can only combine
> a Query with a ValueSourceQuery, while I need to combine a Query with
> another Query (e.g. the example above with two TermQuery scores).
>
> How should I go about this?
>
> Thanks in advance,
>
> Steve
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


RE: RangeFilter performance problem using MultiReader

2009-04-10 Thread Uwe Schindler
You got a lot of answers and questions about your index structure. Now
another idea, maybe this helps you to speed up your RangeFilter:

What type of range do you want to query? From your index statistics, it
looks like a numeric/date field from which you filter very large ranges. If
the values are very fine-grained and so you hit a lot of terms for the
range, you might consider using TrieRangeFilter, which is a new contrib
module in the yet unreleased Lucene 2.9:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apach
e/lucene/search/trie/package-summary.html

The name and API may change before release (if it moves to core), but you
can try it out, it is working stable and currently runs in productive
websites! It works for int, long, double, float and Date values (if encoded
using Date.getTime() as long).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Raf [mailto:r.ventag...@gmail.com]
> Sent: Friday, April 10, 2009 4:38 PM
> To: java-user@lucene.apache.org
> Subject: RangeFilter performance problem using MultiReader
> 
> Hi,
> we are experiencing some problems using RangeFilters and we think there
> are
> some performance issues caused by MultiReader.
> 
> We have more or less 3M documents in 24 indexes and we read all of them
> using a MultiReader.
> If we do a search using only terms, there are no problems, but it if we
> add
> to the same search terms a RangeFilter that extracts a large subset of the
> documents (e.g. 500K), it takes a lot of time to execute (about 15s).
> 
> In order to identify the problem, we have tried to consolidate the index:
> so
> now we have the same 3M docs in a single 10GB index.
> If we repeat the same search using this index, it takes only a small
> fraction of the previous time (about 2s).
> 
> Is there something we can do to improve search performance using
> RangeFilters with MultiReader or the only solution is to have only a
> single
> big index?
> 
> Thanks,
> Raf


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: exponential boosts

2009-04-10 Thread Steven Bethard
On 4/10/2009 1:08 PM, Jack Stahl wrote:
> Perhaps you'd find it easier to implement the equivalent:
> 
> log(s1(d, q))*a1 + ... + log(sN(d, q))*aN

Yes, that's fine too - that's actually what I'd be optimizing anyway.

But how would I do that? If I took the query boost route, how do I get a
TermQuery to produce a score in log-space, while keeping the boost in
regular space? Or if I took the CustomScoreQuery route, how do I combine
two Query scores (not a Query score and a ValueSourceQuery score)?

Steve

> 
> On Fri, Apr 10, 2009 at 12:56 PM, Steven Bethard wrote:
> 
>> I need to have a scoring model of the form:
>>
>>s1(d, q)^a1 * s2(d, q)^a2 * ... * sN(d, q)^aN
>>
>> where "d" is a document, "q" is a query, "sK" is a scoring function, and
>> "aK" is the exponential boost factor for that scoring function. As a
>> simple example, I might have:
>>
>>s1 = TF-IDF score matching "text" field (e.g. a TermQuery)
>>a1 = 1.0
>>
>>s2 = TF-IDF score matching "author" field (e.g. a TermQuery)
>>a2 = 0.1
>>
>>s3 = PageRank score (e.g. a FieldScoreQuery)
>>a3 = 0.5
>>
>> It's important that the "aK" parameters are exponents in the scoring
>> function and not just multipliers because it allows me to do a
>> particular kind of optimized search for the best parameter values.
>>
>> How can I achieve this? My first thought was just that I should set the
>> boost factor for each query, but the boost factor is just a multiplier,
>> right?
>>
>> My second thought was to subclass CustomScoreQuery and override
>> customScore, but as far as I can tell, CustomScoreQuery can only combine
>> a Query with a ValueSourceQuery, while I need to combine a Query with
>> another Query (e.g. the example above with two TermQuery scores).
>>
>> How should I go about this?
>>
>> Thanks in advance,
>>
>> Steve
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Sequential match query

2009-04-10 Thread John Seer

Hello,
I have 3 terms and I want to much them in order I tried to use wildcard
query I am not getting any results back

Terms: A C F

Doc: name:A B C D E F

query: name:A*C*F

I am not getting any results back, 
Please any suggestions? 

Thanks for help in advance 
-- 
View this message in context: 
http://www.nabble.com/Sequential-match-query-tp22995240p22995240.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Different Analyzer for different fields in the same document

2009-04-10 Thread John Seer

Hello,
There is any way that a single document fields can have different analyzers
for different fields?

I think one way of doing it to create custom analyzer which will do field
spastic analyzes..

Any other suggestions?
-- 
View this message in context: 
http://www.nabble.com/Different-Analyzer-for-different-fields-in-the-same-document-tp22995442p22995442.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Different Analyzer for different fields in the same document

2009-04-10 Thread Koji Sekiguchi

John Seer wrote:

Hello,
There is any way that a single document fields can have different analyzers
for different fields?

I think one way of doing it to create custom analyzer which will do field
spastic analyzes..

Any other suggestions?
  


There is PerFieldAnalyzerWrapper
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html

Koji


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: exponential boosts

2009-04-10 Thread Steven Bethard
On 4/10/2009 12:56 PM, Steven Bethard wrote:
> I need to have a scoring model of the form:
> 
> s1(d, q)^a1 * s2(d, q)^a2 * ... * sN(d, q)^aN
> 
> where "d" is a document, "q" is a query, "sK" is a scoring function, and
> "aK" is the exponential boost factor for that scoring function. As a
> simple example, I might have:
> 
> s1 = TF-IDF score matching "text" field (e.g. a TermQuery)
> a1 = 1.0
> 
> s2 = TF-IDF score matching "author" field (e.g. a TermQuery)
> a2 = 0.1
> 
> s3 = PageRank score (e.g. a FieldScoreQuery)
> a3 = 0.5
> 
> It's important that the "aK" parameters are exponents in the scoring
> function and not just multipliers because it allows me to do a
> particular kind of optimized search for the best parameter values.
> 
> How can I achieve this? My first thought was just that I should set the
> boost factor for each query, but the boost factor is just a multiplier,
> right?
> 
> My second thought was to subclass CustomScoreQuery and override
> customScore, but as far as I can tell, CustomScoreQuery can only combine
> a Query with a ValueSourceQuery, while I need to combine a Query with
> another Query (e.g. the example above with two TermQuery scores).

My third thought was to create a wrapper class that takes a Query and an
exponential boost factor. The wrapper class would delegate to the Query
for all methods except .weight(). For .weight(), it would return a
Weight wrapper that delegated to the Weight for all methods except
.getValue(). For .getValue(), it would return the original value, raised
to the appropriate exponent. But will that really work, or am I going to
mess up the normalization or something else?

Steve

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Different Analyzer for different fields in the same document

2009-04-10 Thread John Seer

Thanks this is useful class for future...

Koji Sekiguchi-2 wrote:
> 
> John Seer wrote:
>> Hello,
>> There is any way that a single document fields can have different
>> analyzers
>> for different fields?
>>
>> I think one way of doing it to create custom analyzer which will do field
>> spastic analyzes..
>>
>> Any other suggestions?
>>   
> 
> There is PerFieldAnalyzerWrapper
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html
> 
> Koji
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Different-Analyzer-for-different-fields-in-the-same-document-tp22995442p22996572.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org