Lucene and numerical fields search

2005-07-12 Thread Rifflard Mickaël
Hi all,

I'm using Lucene as a fulltext search engine since a year now and this one 
works well for this.
Now, I want to add to my application search capability like : aField greater 
than 10 , aField between 10 and 20.
For this, I used RangeQuery (aField:[10 TO 20] for exemple) and I received all 
documents with field between these 2 values.
All was perfect before... benchs. These one were poor. I received results after 
dozens of seconds.

Here is reasons of this : 

Firstly, I don't know min or max values for my field so a request can be 
aField:[0 TO 100]. 
If I understand, this request is tranform in a BooleanQuery with one million of 
TermQuery separated by OR.

Secondly, to perform greater than or lower than request, I wanted to use 
RangeQuery with Integer.MAX_VALUE (for greater than) 
or Integer.MIN_VALUE (for lower than). If I understand always, this RangeQuery 
would be transform in a BooleanQuery with many 
millions of TermQuery. 

How to perform this kind of searchs (with another Lucene use, with anoter 
solution, ...) ?

Many thanks for your comments.


Mickaël Rifflard



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and numerical fields search

2005-07-12 Thread Paul . Illingworth





I have similar requirements. To get around the "Too many clauses" problem I
am creating a Filter (this takes one or two seconds to create on an index
of around 25 documents) instead of using the RangeQuery. It's not ideal
but it does sidestep the problem. If you are using the same range in your
queries then you can cache the filters and these can be reused on the same
IndexReader/IndexSearcher instance.

Paul I.


   
 Rifflard Mickaël
To
  
 12/07/2005 08:49   cc
   
   Subject
 Please respond to Lucene and numerical fields search
 [EMAIL PROTECTED] 
apache.org 
   
   
   
   




Hi all,

I'm using Lucene as a fulltext search engine since a year now and this one
works well for this.
Now, I want to add to my application search capability like : aField
greater than 10 , aField between 10 and 20.
For this, I used RangeQuery (aField:[10 TO 20] for exemple) and I received
all documents with field between these 2 values.
All was perfect before... benchs. These one were poor. I received results
after dozens of seconds.

Here is reasons of this :

Firstly, I don't know min or max values for my field so a request can be
aField:[0 TO 100].
If I understand, this request is tranform in a BooleanQuery with one
million of TermQuery separated by OR.

Secondly, to perform greater than or lower than request, I wanted to use
RangeQuery with Integer.MAX_VALUE (for greater than)
or Integer.MIN_VALUE (for lower than). If I understand always, this
RangeQuery would be transform in a BooleanQuery with many
millions of TermQuery.

How to perform this kind of searchs (with another Lucene use, with anoter
solution, ...) ?

Many thanks for your comments.


 Mickaël Rifflard



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene and numerical fields search

2005-07-12 Thread Rifflard Mickaël
Hi Paul,

I have seen Filter feature and search how to use it to solve my problem.
But users can search in indeterminate range so I don't find how to use filters 
in that case.

Mickaël

-Message d'origine-
De : [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]
Envoyé : mardi 12 juillet 2005 10:16
À : java-user@lucene.apache.org
Objet : Re: Lucene and numerical fields search







I have similar requirements. To get around the "Too many clauses" problem I
am creating a Filter (this takes one or two seconds to create on an index
of around 25 documents) instead of using the RangeQuery. It's not ideal
but it does sidestep the problem. If you are using the same range in your
queries then you can cache the filters and these can be reused on the same
IndexReader/IndexSearcher instance.

Paul I.


   
 Rifflard Mickaël  
To 
  
 12/07/2005 08:49   cc 
   
   Subject 
 Please respond to Lucene and numerical fields search  
 [EMAIL PROTECTED] 
apache.org 
   
   
   
   




Hi all,

I'm using Lucene as a fulltext search engine since a year now and this one
works well for this.
Now, I want to add to my application search capability like : aField
greater than 10 , aField between 10 and 20.
For this, I used RangeQuery (aField:[10 TO 20] for exemple) and I received
all documents with field between these 2 values.
All was perfect before... benchs. These one were poor. I received results
after dozens of seconds.

Here is reasons of this :

Firstly, I don't know min or max values for my field so a request can be
aField:[0 TO 100].
If I understand, this request is tranform in a BooleanQuery with one
million of TermQuery separated by OR.

Secondly, to perform greater than or lower than request, I wanted to use
RangeQuery with Integer.MAX_VALUE (for greater than)
or Integer.MIN_VALUE (for lower than). If I understand always, this
RangeQuery would be transform in a BooleanQuery with many
millions of TermQuery.

How to perform this kind of searchs (with another Lucene use, with anoter
solution, ...) ?

Many thanks for your comments.


 Mickaël Rifflard



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene and numerical fields search

2005-07-12 Thread Paul . Illingworth





Hi Mickaël,

Take a look at the org.apache.lucene.search.DateFilter class that comes
with Lucene. This does date range filtering (I am using a modified version
of this class for filtering my date format). It should be relatively
strightforward to modify this for filtering numeric ranges. If your numbers
are stored as zero padded Strings then you should be able to leave the
bits() method as is otherwise you might have to put some String to number
conversion in there somewhere.

Regards

Paul I


   
 Rifflard Mickaël
To
  
 12/07/2005 09:31   cc
   
   Subject
 Please respond to RE: Lucene and numerical fields 
 [EMAIL PROTECTED] search  
apache.org 
   
   
   
   
   




Hi Paul,

I have seen Filter feature and search how to use it to solve my problem
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Partitioning ( was Re: Search deadlocking under load)

2005-07-12 Thread Chris Hostetter

: Since this isn't in production yet, I'd rather be proven wrong now
: rather than later! :)

it sounds like what you're doing makes a lot of sense given your
situation, and the nature of your data.

the one thing you might not have concidered yet, which doesn't have to
make a big difference in your overall architecture, but might influence
the specifics of your design, is the idea that eventually you might want
to seperate Projects on onto different physical servers, letting you put
"important" projects on their own server, so they are alllways available
(even if they are the LRU).



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to get the un-stemed word

2005-07-12 Thread Andrew Boyd
Thats why (at least one of the reasons) I wish the token type was stored in the 
index.

-Original Message-
From: markharw00d <[EMAIL PROTECTED]>
Sent: Jul 11, 2005 4:08 PM
To: java-user@lucene.apache.org
Subject: Re: How to get the un-stemed word

>>Would that show up in the TermVectors?

Yes, but uou would need a scheme for identifying "original, unstemmed" terms vs 
stems. For example, you could use another field and analyzer for the unstemmed 
forms.


Andrew Boyd wrote:

>What about storing the unstemed word with the same position as the stemmed 
>word.  Would that show up in the TermVectors?
>
>-Original Message-
>From: mark harwood <[EMAIL PROTECTED]>
>Sent: Jul 8, 2005 10:44 AM
>To: java-user@lucene.apache.org, Andrew Boyd <[EMAIL PROTECTED]>
>Subject: Re: How to get the un-stemed word
>
>You can get the unstemmed word by re-analysing the
>(hopefully stored somewhere) text.
>Look at the tokens emitted from the TokenStream and
>when you get to the one that matches the stemmed form
>you can use the token offset info to retrieve the
>unstemmed form from the original text. 
>
>Another option which avoids re-analysis is to store
>the TermVector with TermPositionVector info enabled.
>All the offsets are then stored in the index, rather
>than computed on-the-fly by an Analyzer.
>
>The highlighter in the sandbox can use both of these
>approaches to get the original forms.
>
>Cheers
>Mark
>
>
>   
>   
>   
>___ 
>Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail 
>http://uk.messenger.yahoo.com
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>Andrew Boyd
>Software Architect
>Sun Certified J2EE Architect
>B&B Technical Services Inc.
>205.422.2557
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>  
>



___ 
How much free photo storage do you get? Store your holiday 
snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and numerical fields search

2005-07-12 Thread Yonik Seeley
I use ConstantScoreRangeQuery for this purpose:

http://issues.apache.org/bugzilla/show_bug.cgi?id=34673

-Yonik

On 7/12/05, Rifflard Mickaël <[EMAIL PROTECTED]> wrote:
> Hi all,
> 
> I'm using Lucene as a fulltext search engine since a year now and this one 
> works well for this.
> Now, I want to add to my application search capability like : aField greater 
> than 10 , aField between 10 and 20.
> For this, I used RangeQuery (aField:[10 TO 20] for exemple) and I received 
> all documents with field between these 2 values.
> All was perfect before... benchs. These one were poor. I received results 
> after dozens of seconds.
> 
> Here is reasons of this :
> 
> Firstly, I don't know min or max values for my field so a request can be 
> aField:[0 TO 100].
> If I understand, this request is tranform in a BooleanQuery with one million 
> of TermQuery separated by OR.
> 
> Secondly, to perform greater than or lower than request, I wanted to use 
> RangeQuery with Integer.MAX_VALUE (for greater than)
> or Integer.MIN_VALUE (for lower than). If I understand always, this 
> RangeQuery would be transform in a BooleanQuery with many
> millions of TermQuery.
> 
> How to perform this kind of searchs (with another Lucene use, with anoter 
> solution, ...) ?
> 
> Many thanks for your comments.
> 
> 
> Mickaël Rifflard
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and numerical fields search

2005-07-12 Thread Ray Tsang
It seems TooManyClauses is a potential problem for any query that
expands to a series of OR'ed boolean queries (PrefixQuery,
WildcardQuery, RangeQuery...).  If the max was set too high, the
inefficiency would make the search unsable.

I kind of worked around this by creating a BitSetQuery, and extended
PrefixQuery and WildcardQuery so that they rewrite to BitSetQuery. 
This at least made both queries usable on a large index (~40mil
documents) with acceptable speed, without the use of filters.  I also
extended QueryParser so it creates the new PrefixQuery and the new
WildcardQuery instead of the ones provided by the distribution.

Ray

On 7/12/05, [EMAIL PROTECTED]
<[EMAIL PROTECTED]> wrote:
> 
> 
> 
> 
> 
> Hi Mickaël,
> 
> Take a look at the org.apache.lucene.search.DateFilter class that comes
> with Lucene. This does date range filtering (I am using a modified version
> of this class for filtering my date format). It should be relatively
> strightforward to modify this for filtering numeric ranges. If your numbers
> are stored as zero padded Strings then you should be able to leave the
> bits() method as is otherwise you might have to put some String to number
> conversion in there somewhere.
> 
> Regards
> 
> Paul I
> 
> 
> 
>  Rifflard Mickaël
>@atosorigin.com>   To
>
>  12/07/2005 09:31   cc
> 
>Subject
>  Please respond to RE: Lucene and numerical fields
>  [EMAIL PROTECTED] search
> apache.org
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Hi Paul,
> 
> I have seen Filter feature and search how to use it to solve my problem
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search deadlocking under load

2005-07-12 Thread Nathan Brackett
Otis,

After further testing it turns out that the 'deadlock' we're encountering is
not a deadlock at all, but a result of resin hitting its maximum number of
allowed threads.  We bumped up the max-threads in the config and it fixed
the problem for a certain amount of load, but we'd much prefer to go after
the source of the problem, namely:

As the number of threads hitting lucene increases, contention for locks
increases, meaning the average response time decreases.  This places us in a
downward spiral of performance because as the incoming number of hits per
second stays constant, the response time decreases, meaning that the total
number of threads inside resin doing work will increase.  This problem
compounds itself, escalating the number of threads in resin until we crash.


Admittedly this is a pretty harsh test (~~20 hits per second triggering
complex searches, which starts fine but then escalates to > 150 threads as
processing slows down but number of incoming hits per second does not)

Our ultimate goal, however, is to have each search be completely and 100%
parallel.

The point of contention seems to be the method below:

FSDirectory.java:486 (class FSInputStream)



  protected final void readInternal(byte[] b, int offset, int len)
throws IOException {
synchronized (file) {
long position = getFilePointer();
if (position != file.position) {
file.seek(position);
file.position = position;
}
int total = 0;
do {
int i = file.read(b, offset+total, len-total);
if (i == -1)
throw new IOException("read past EOF");
file.position += i;
total += i;
} while (total < len);
}
  }




The threads are usually all lined up to reach this.  Why are so many threads
backed up behind the same instance of FSInputStream.readInternal?  Shouldn't
each search have a different input stream?  What would you suggest as the
best path to achieve 100% parallel searching?  Here's a sample of our thread
dump, you can see 2 threads waiting for the same FSInputStream$Descriptor
(which is the synchronized(file) above):

"tcpConnection-8080-11" daemon prio=5 tid=0x08304600 nid=0x8304800 waiting
for monitor entry [bf494000..bf494d08]
at
org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:412)
- waiting to lock <0x2f2b7a38> (a
org.apache.lucene.store.FSInputStream$Descriptor)
at org.apache.lucene.store.InputStream.refill(InputStream.java:158)
at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
at
org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:126)
at org.apache.lucene.search.TermScorer.next(TermScorer.java:55)
at
org.apache.lucene.search.BooleanScorer.next(BooleanScorer.java:112)
at org.apache.lucene.search.Scorer.score(Scorer.java:37)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
at org.apache.lucene.search.Hits.(Hits.java:43)
at org.apache.lucene.search.Searcher.search(Searcher.java:33)
at org.apache.lucene.search.Searcher.search(Searcher.java:27)
at
com.nettemps.search.backend.SingleIndexManager.search(SingleIndexManager.jav
a:335)
at
com.nettemps.search.backend.IndexAccessControl.doSearch(IndexAccessControl.j
ava:100)

"tcpConnection-8080-10" daemon prio=5 tid=0x08336800 nid=0x8336a00 waiting
for monitor entry [bf4d5000..bf4d5d08]
at
org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:412)
- waiting to lock <0x2f2b7a38> (a
org.apache.lucene.store.FSInputStream$Descriptor)
at org.apache.lucene.store.InputStream.refill(InputStream.java:158)
at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
at
org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:126)
at org.apache.lucene.search.TermScorer.next(TermScorer.java:55)
at
org.apache.lucene.search.BooleanScorer.next(BooleanScorer.java:112)
at org.apache.lucene.search.Scorer.score(Scorer.java:37)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
at org.apache.lucene.search.Hits.(Hits.java:43)
at org.apache.lucene.search.Searcher.search(Searcher.java:33)
at org.apache.lucene.search.Searcher.search(Searcher.java:27)
at
com.nettemps.search.backend.SingleIndexManager.search(SingleIndexManager.jav
a:335)

-Original Message-
From: Nathan Brackett [mailto

Re: Index Partitioning ( was Re: Search deadlocking under load)

2005-07-12 Thread Paul Smith


On 13/07/2005, at 1:34 AM, Chris Hostetter wrote:



: Since this isn't in production yet, I'd rather be proven wrong now
: rather than later! :)

it sounds like what you're doing makes a lot of sense given your
situation, and the nature of your data.

the one thing you might not have concidered yet, which doesn't have to
make a big difference in your overall architecture, but might  
influence
the specifics of your design, is the idea that eventually you might  
want
to seperate Projects on onto different physical servers, letting  
you put
"important" projects on their own server, so they are alllways  
available

(even if they are the LRU).



Yes, thanks, initially we won't do this until we understand more  
about the profile of usage, and how the IndexSearchers are being aged  
out of the cache.  We have a mirror index server kept in sync, and  
plan to put Apache in front of them (as long as we can prove the 2  
parts of the mirror stay in sync, initially we'll just set apache to  
favor 1 server, with manual failover until we're completely sure).   
We have plans to be implemented eventually that include an Index  
partitioning such that not all projects sit on each server, and they  
broadcast what project contain to clients.


Paul





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SIMPLE Lucene / MySQL Indexer

2005-07-12 Thread Klaus Hubert
Hi,

I played with several search engines to replace MySQL
FULLTEXT index and hope that Lucene is the best
solution for that.

I am reading Mannings book on Lucene in action and it
seems to be the most powerful search engine I found so
far.

I'm stuck at some problem and need help from you
experts. I managed to create an index as described in
the examples. I also managed to read a MySQL database
in Java.

My question is, if anybody here has some SIMPLE
example which does this in one step. I am good in PHP
and in Visual Basic, but very new to Java. Maybe I'm
using the wrong tools (NetBeans IDE and JCreator) but
I don't get it managed to create an Lucene Index on 3
database fields.

I appreciate any help.

Thank you so much,

  Klaus

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SIMPLE Lucene / MySQL Indexer

2005-07-12 Thread Chris Lu

Please allow me to intraduce DBSight.
It's based on Lucene, oriented for Any database search.

Most of the things are done by web UI. No coding is needed to create 
your search.

check out this demo.  http://search.dbsight.com

It's free to download and test. Free for developer edition, non-profit 
usage.


Chris Lu
---
Full-Text Search on Any Database
http://www.dbsight.net

Klaus Hubert wrote:


Hi,

I played with several search engines to replace MySQL
FULLTEXT index and hope that Lucene is the best
solution for that.

I am reading Mannings book on Lucene in action and it
seems to be the most powerful search engine I found so
far.

I'm stuck at some problem and need help from you
experts. I managed to create an index as described in
the examples. I also managed to read a MySQL database
in Java.

My question is, if anybody here has some SIMPLE
example which does this in one step. I am good in PHP
and in Visual Basic, but very new to Java. Maybe I'm
using the wrong tools (NetBeans IDE and JCreator) but
I don't get it managed to create an Lucene Index on 3
database fields.

I appreciate any help.

Thank you so much,

 Klaus

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 




--
Chris Lu
--
Free-Text Search on Any Database
http://www.dbsight.net


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]