Lucene and numerical fields search
Hi all, I'm using Lucene as a fulltext search engine since a year now and this one works well for this. Now, I want to add to my application search capability like : aField greater than 10 , aField between 10 and 20. For this, I used RangeQuery (aField:[10 TO 20] for exemple) and I received all documents with field between these 2 values. All was perfect before... benchs. These one were poor. I received results after dozens of seconds. Here is reasons of this : Firstly, I don't know min or max values for my field so a request can be aField:[0 TO 100]. If I understand, this request is tranform in a BooleanQuery with one million of TermQuery separated by OR. Secondly, to perform greater than or lower than request, I wanted to use RangeQuery with Integer.MAX_VALUE (for greater than) or Integer.MIN_VALUE (for lower than). If I understand always, this RangeQuery would be transform in a BooleanQuery with many millions of TermQuery. How to perform this kind of searchs (with another Lucene use, with anoter solution, ...) ? Many thanks for your comments. Mickaël Rifflard - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and numerical fields search
I have similar requirements. To get around the "Too many clauses" problem I am creating a Filter (this takes one or two seconds to create on an index of around 25 documents) instead of using the RangeQuery. It's not ideal but it does sidestep the problem. If you are using the same range in your queries then you can cache the filters and these can be reused on the same IndexReader/IndexSearcher instance. Paul I. Rifflard Mickaël To 12/07/2005 08:49 cc Subject Please respond to Lucene and numerical fields search [EMAIL PROTECTED] apache.org Hi all, I'm using Lucene as a fulltext search engine since a year now and this one works well for this. Now, I want to add to my application search capability like : aField greater than 10 , aField between 10 and 20. For this, I used RangeQuery (aField:[10 TO 20] for exemple) and I received all documents with field between these 2 values. All was perfect before... benchs. These one were poor. I received results after dozens of seconds. Here is reasons of this : Firstly, I don't know min or max values for my field so a request can be aField:[0 TO 100]. If I understand, this request is tranform in a BooleanQuery with one million of TermQuery separated by OR. Secondly, to perform greater than or lower than request, I wanted to use RangeQuery with Integer.MAX_VALUE (for greater than) or Integer.MIN_VALUE (for lower than). If I understand always, this RangeQuery would be transform in a BooleanQuery with many millions of TermQuery. How to perform this kind of searchs (with another Lucene use, with anoter solution, ...) ? Many thanks for your comments. Mickaël Rifflard - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene and numerical fields search
Hi Paul, I have seen Filter feature and search how to use it to solve my problem. But users can search in indeterminate range so I don't find how to use filters in that case. Mickaël -Message d'origine- De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Envoyé : mardi 12 juillet 2005 10:16 À : java-user@lucene.apache.org Objet : Re: Lucene and numerical fields search I have similar requirements. To get around the "Too many clauses" problem I am creating a Filter (this takes one or two seconds to create on an index of around 25 documents) instead of using the RangeQuery. It's not ideal but it does sidestep the problem. If you are using the same range in your queries then you can cache the filters and these can be reused on the same IndexReader/IndexSearcher instance. Paul I. Rifflard Mickaël To 12/07/2005 08:49 cc Subject Please respond to Lucene and numerical fields search [EMAIL PROTECTED] apache.org Hi all, I'm using Lucene as a fulltext search engine since a year now and this one works well for this. Now, I want to add to my application search capability like : aField greater than 10 , aField between 10 and 20. For this, I used RangeQuery (aField:[10 TO 20] for exemple) and I received all documents with field between these 2 values. All was perfect before... benchs. These one were poor. I received results after dozens of seconds. Here is reasons of this : Firstly, I don't know min or max values for my field so a request can be aField:[0 TO 100]. If I understand, this request is tranform in a BooleanQuery with one million of TermQuery separated by OR. Secondly, to perform greater than or lower than request, I wanted to use RangeQuery with Integer.MAX_VALUE (for greater than) or Integer.MIN_VALUE (for lower than). If I understand always, this RangeQuery would be transform in a BooleanQuery with many millions of TermQuery. How to perform this kind of searchs (with another Lucene use, with anoter solution, ...) ? Many thanks for your comments. Mickaël Rifflard - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene and numerical fields search
Hi Mickaël, Take a look at the org.apache.lucene.search.DateFilter class that comes with Lucene. This does date range filtering (I am using a modified version of this class for filtering my date format). It should be relatively strightforward to modify this for filtering numeric ranges. If your numbers are stored as zero padded Strings then you should be able to leave the bits() method as is otherwise you might have to put some String to number conversion in there somewhere. Regards Paul I Rifflard Mickaël To 12/07/2005 09:31 cc Subject Please respond to RE: Lucene and numerical fields [EMAIL PROTECTED] search apache.org Hi Paul, I have seen Filter feature and search how to use it to solve my problem - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Partitioning ( was Re: Search deadlocking under load)
: Since this isn't in production yet, I'd rather be proven wrong now : rather than later! :) it sounds like what you're doing makes a lot of sense given your situation, and the nature of your data. the one thing you might not have concidered yet, which doesn't have to make a big difference in your overall architecture, but might influence the specifics of your design, is the idea that eventually you might want to seperate Projects on onto different physical servers, letting you put "important" projects on their own server, so they are alllways available (even if they are the LRU). -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to get the un-stemed word
Thats why (at least one of the reasons) I wish the token type was stored in the index. -Original Message- From: markharw00d <[EMAIL PROTECTED]> Sent: Jul 11, 2005 4:08 PM To: java-user@lucene.apache.org Subject: Re: How to get the un-stemed word >>Would that show up in the TermVectors? Yes, but uou would need a scheme for identifying "original, unstemmed" terms vs stems. For example, you could use another field and analyzer for the unstemmed forms. Andrew Boyd wrote: >What about storing the unstemed word with the same position as the stemmed >word. Would that show up in the TermVectors? > >-Original Message- >From: mark harwood <[EMAIL PROTECTED]> >Sent: Jul 8, 2005 10:44 AM >To: java-user@lucene.apache.org, Andrew Boyd <[EMAIL PROTECTED]> >Subject: Re: How to get the un-stemed word > >You can get the unstemmed word by re-analysing the >(hopefully stored somewhere) text. >Look at the tokens emitted from the TokenStream and >when you get to the one that matches the stemmed form >you can use the token offset info to retrieve the >unstemmed form from the original text. > >Another option which avoids re-analysis is to store >the TermVector with TermPositionVector info enabled. >All the offsets are then stored in the index, rather >than computed on-the-fly by an Analyzer. > >The highlighter in the sandbox can use both of these >approaches to get the original forms. > >Cheers >Mark > > > > > >___ >Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail >http://uk.messenger.yahoo.com > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > >Andrew Boyd >Software Architect >Sun Certified J2EE Architect >B&B Technical Services Inc. >205.422.2557 > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > ___ How much free photo storage do you get? Store your holiday snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and numerical fields search
I use ConstantScoreRangeQuery for this purpose: http://issues.apache.org/bugzilla/show_bug.cgi?id=34673 -Yonik On 7/12/05, Rifflard Mickaël <[EMAIL PROTECTED]> wrote: > Hi all, > > I'm using Lucene as a fulltext search engine since a year now and this one > works well for this. > Now, I want to add to my application search capability like : aField greater > than 10 , aField between 10 and 20. > For this, I used RangeQuery (aField:[10 TO 20] for exemple) and I received > all documents with field between these 2 values. > All was perfect before... benchs. These one were poor. I received results > after dozens of seconds. > > Here is reasons of this : > > Firstly, I don't know min or max values for my field so a request can be > aField:[0 TO 100]. > If I understand, this request is tranform in a BooleanQuery with one million > of TermQuery separated by OR. > > Secondly, to perform greater than or lower than request, I wanted to use > RangeQuery with Integer.MAX_VALUE (for greater than) > or Integer.MIN_VALUE (for lower than). If I understand always, this > RangeQuery would be transform in a BooleanQuery with many > millions of TermQuery. > > How to perform this kind of searchs (with another Lucene use, with anoter > solution, ...) ? > > Many thanks for your comments. > > > Mickaël Rifflard > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and numerical fields search
It seems TooManyClauses is a potential problem for any query that expands to a series of OR'ed boolean queries (PrefixQuery, WildcardQuery, RangeQuery...). If the max was set too high, the inefficiency would make the search unsable. I kind of worked around this by creating a BitSetQuery, and extended PrefixQuery and WildcardQuery so that they rewrite to BitSetQuery. This at least made both queries usable on a large index (~40mil documents) with acceptable speed, without the use of filters. I also extended QueryParser so it creates the new PrefixQuery and the new WildcardQuery instead of the ones provided by the distribution. Ray On 7/12/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > > > > > Hi Mickaël, > > Take a look at the org.apache.lucene.search.DateFilter class that comes > with Lucene. This does date range filtering (I am using a modified version > of this class for filtering my date format). It should be relatively > strightforward to modify this for filtering numeric ranges. If your numbers > are stored as zero padded Strings then you should be able to leave the > bits() method as is otherwise you might have to put some String to number > conversion in there somewhere. > > Regards > > Paul I > > > > Rifflard Mickaël >@atosorigin.com> To > > 12/07/2005 09:31 cc > >Subject > Please respond to RE: Lucene and numerical fields > [EMAIL PROTECTED] search > apache.org > > > > > > > > > > Hi Paul, > > I have seen Filter feature and search how to use it to solve my problem > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search deadlocking under load
Otis, After further testing it turns out that the 'deadlock' we're encountering is not a deadlock at all, but a result of resin hitting its maximum number of allowed threads. We bumped up the max-threads in the config and it fixed the problem for a certain amount of load, but we'd much prefer to go after the source of the problem, namely: As the number of threads hitting lucene increases, contention for locks increases, meaning the average response time decreases. This places us in a downward spiral of performance because as the incoming number of hits per second stays constant, the response time decreases, meaning that the total number of threads inside resin doing work will increase. This problem compounds itself, escalating the number of threads in resin until we crash. Admittedly this is a pretty harsh test (~~20 hits per second triggering complex searches, which starts fine but then escalates to > 150 threads as processing slows down but number of incoming hits per second does not) Our ultimate goal, however, is to have each search be completely and 100% parallel. The point of contention seems to be the method below: FSDirectory.java:486 (class FSInputStream) protected final void readInternal(byte[] b, int offset, int len) throws IOException { synchronized (file) { long position = getFilePointer(); if (position != file.position) { file.seek(position); file.position = position; } int total = 0; do { int i = file.read(b, offset+total, len-total); if (i == -1) throw new IOException("read past EOF"); file.position += i; total += i; } while (total < len); } } The threads are usually all lined up to reach this. Why are so many threads backed up behind the same instance of FSInputStream.readInternal? Shouldn't each search have a different input stream? What would you suggest as the best path to achieve 100% parallel searching? Here's a sample of our thread dump, you can see 2 threads waiting for the same FSInputStream$Descriptor (which is the synchronized(file) above): "tcpConnection-8080-11" daemon prio=5 tid=0x08304600 nid=0x8304800 waiting for monitor entry [bf494000..bf494d08] at org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:412) - waiting to lock <0x2f2b7a38> (a org.apache.lucene.store.FSInputStream$Descriptor) at org.apache.lucene.store.InputStream.refill(InputStream.java:158) at org.apache.lucene.store.InputStream.readByte(InputStream.java:43) at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) at org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:126) at org.apache.lucene.search.TermScorer.next(TermScorer.java:55) at org.apache.lucene.search.BooleanScorer.next(BooleanScorer.java:112) at org.apache.lucene.search.Scorer.score(Scorer.java:37) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) at org.apache.lucene.search.Hits.(Hits.java:43) at org.apache.lucene.search.Searcher.search(Searcher.java:33) at org.apache.lucene.search.Searcher.search(Searcher.java:27) at com.nettemps.search.backend.SingleIndexManager.search(SingleIndexManager.jav a:335) at com.nettemps.search.backend.IndexAccessControl.doSearch(IndexAccessControl.j ava:100) "tcpConnection-8080-10" daemon prio=5 tid=0x08336800 nid=0x8336a00 waiting for monitor entry [bf4d5000..bf4d5d08] at org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:412) - waiting to lock <0x2f2b7a38> (a org.apache.lucene.store.FSInputStream$Descriptor) at org.apache.lucene.store.InputStream.refill(InputStream.java:158) at org.apache.lucene.store.InputStream.readByte(InputStream.java:43) at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) at org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:126) at org.apache.lucene.search.TermScorer.next(TermScorer.java:55) at org.apache.lucene.search.BooleanScorer.next(BooleanScorer.java:112) at org.apache.lucene.search.Scorer.score(Scorer.java:37) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) at org.apache.lucene.search.Hits.(Hits.java:43) at org.apache.lucene.search.Searcher.search(Searcher.java:33) at org.apache.lucene.search.Searcher.search(Searcher.java:27) at com.nettemps.search.backend.SingleIndexManager.search(SingleIndexManager.jav a:335) -Original Message- From: Nathan Brackett [mailto
Re: Index Partitioning ( was Re: Search deadlocking under load)
On 13/07/2005, at 1:34 AM, Chris Hostetter wrote: : Since this isn't in production yet, I'd rather be proven wrong now : rather than later! :) it sounds like what you're doing makes a lot of sense given your situation, and the nature of your data. the one thing you might not have concidered yet, which doesn't have to make a big difference in your overall architecture, but might influence the specifics of your design, is the idea that eventually you might want to seperate Projects on onto different physical servers, letting you put "important" projects on their own server, so they are alllways available (even if they are the LRU). Yes, thanks, initially we won't do this until we understand more about the profile of usage, and how the IndexSearchers are being aged out of the cache. We have a mirror index server kept in sync, and plan to put Apache in front of them (as long as we can prove the 2 parts of the mirror stay in sync, initially we'll just set apache to favor 1 server, with manual failover until we're completely sure). We have plans to be implemented eventually that include an Index partitioning such that not all projects sit on each server, and they broadcast what project contain to clients. Paul -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
SIMPLE Lucene / MySQL Indexer
Hi, I played with several search engines to replace MySQL FULLTEXT index and hope that Lucene is the best solution for that. I am reading Mannings book on Lucene in action and it seems to be the most powerful search engine I found so far. I'm stuck at some problem and need help from you experts. I managed to create an index as described in the examples. I also managed to read a MySQL database in Java. My question is, if anybody here has some SIMPLE example which does this in one step. I am good in PHP and in Visual Basic, but very new to Java. Maybe I'm using the wrong tools (NetBeans IDE and JCreator) but I don't get it managed to create an Lucene Index on 3 database fields. I appreciate any help. Thank you so much, Klaus __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SIMPLE Lucene / MySQL Indexer
Please allow me to intraduce DBSight. It's based on Lucene, oriented for Any database search. Most of the things are done by web UI. No coding is needed to create your search. check out this demo. http://search.dbsight.com It's free to download and test. Free for developer edition, non-profit usage. Chris Lu --- Full-Text Search on Any Database http://www.dbsight.net Klaus Hubert wrote: Hi, I played with several search engines to replace MySQL FULLTEXT index and hope that Lucene is the best solution for that. I am reading Mannings book on Lucene in action and it seems to be the most powerful search engine I found so far. I'm stuck at some problem and need help from you experts. I managed to create an index as described in the examples. I also managed to read a MySQL database in Java. My question is, if anybody here has some SIMPLE example which does this in one step. I am good in PHP and in Visual Basic, but very new to Java. Maybe I'm using the wrong tools (NetBeans IDE and JCreator) but I don't get it managed to create an Lucene Index on 3 database fields. I appreciate any help. Thank you so much, Klaus __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Chris Lu -- Free-Text Search on Any Database http://www.dbsight.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]