Re: Analyzer on query question
On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky wrote: > Hi, > > I understand that generally speaking you should use the same analyzer on > querying as was used on indexing. In my code I am using the SnowballAnalyzer > on index creation. However, on the query side I am building up a complex > BooleanQuery from other BooleanQuerys and/or PhraseQuerys on several fields. > None of these require specifying an analyzer anywhere. This is causing some > odd results, I think, because a different analyzer (or no analyzer?) is being > used for the query. > > Question: how do I build my boolean and phrase queries using the > SnowballAnalyzer? > > One thing I did that seemed to kind of work was to build my complex query > normally then build a snowball-analyzed query using a QueryParser > instantiated with a SnowballAnalyzer. To do this, I simply pass the string > value of the complex query to the QueryParser.parse() method to get the new > query. Something like this: > > // build a complex query from other BooleanQuerys and PhraseQuerys > BooleanQuery fullQuery = buildComplexQuery(); > QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new > SnowballAnalyzer(Version.LUCENE_30, "English")); > Query snowballAnalyzedQuery = parser.parse(fullQuery.toString()); > > TopScoreDocCollector collector = TopScoreDocCollector.create(1, true); > indexSearcher.search(snowballAnalyzedQuery, collector); you can just use the analyzer directly like this: Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English"); TokenStream stream = analyzer.tokenStream("title", new StringReader(fullQuery.toString()): CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class); stream.reset(); BooleanQuery q = new BooleanQuery(); while(stream.incrementToken()) { q.addClause(new BooleanClause(Occur.MUST, new Term("title", termAttr.toString(; } you also have access to the token positions if you want to create phrase queries etc. just add a PositionIncrementAttribute like this: PositionIncrementAttribute posAttr = stream.addAttribute(PositionsIncrementAttribute.class); pls. doublecheck the code it's straight from the top of my head. simon > > Like I said, this seems to kind of work but it doesn't feel right. Does this > make sense? Is there a better way? > > thanks in advance, > > Bill - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ToParentBlockJoinQuery - Faceting on Parent and Child Documents
Hi Jayendra, This isn't supported yet. You could implement this by creating a custom Lucene collector. This collector could count the unique hits inside a block of docs per unique facet field value. The unique facet values could be retrieved from Lucene's FieldCache or doc values (if you can use Lucene 4.0 in your project). In general I think this would be a cool addition! Martijn On 25 July 2012 13:37, Jayendra Patil wrote: > Thanks Mike for the wonderful work on ToParentBlockJoinQuery. > > We had a use case for Relational data search and are working with > ToParentBlockJoinQuery which works perfectly as mentioned @ > http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html > > However, I couldn't find any examples on net or even in the JUnit > testcases to use Faceting on the Parent or the Child results. > > Is it supported as yet ??? Can you provide us with any examples ?? > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ToParentBlockJoinQuery - Faceting on Parent and Child Documents
Hi Jayendra, we use facetting and blockjoinqueries on lucene 3.6 like this: - Create the FacetsCollector - For facetting on Parent documents, use ToParentBlockJoinQuery, for facetting on children ToChildBlockJoinQuery (if needed, add additional query clauses using a Booleanquery) - Use searcher.search(query,null,facetCollector) This seems to work fine. Best regards, Christoph Kaser Am 03.08.2012 13:50, schrieb Martijn v Groningen: Hi Jayendra, This isn't supported yet. You could implement this by creating a custom Lucene collector. This collector could count the unique hits inside a block of docs per unique facet field value. The unique facet values could be retrieved from Lucene's FieldCache or doc values (if you can use Lucene 4.0 in your project). In general I think this would be a cool addition! Martijn On 25 July 2012 13:37, Jayendra Patil wrote: Thanks Mike for the wonderful work on ToParentBlockJoinQuery. We had a use case for Relational data search and are working with ToParentBlockJoinQuery which works perfectly as mentioned @ http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html However, I couldn't find any examples on net or even in the JUnit testcases to use Faceting on the Parent or the Child results. Is it supported as yet ??? Can you provide us with any examples ?? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Analyzer on query question
Thanks Simon, Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem to have been introduced until 3.1.0. Similarly my version of Lucene does not have a BooleanQuery.addClause(BooleanClause) method. Maybe you meant BooleanQuery.add(BooleanClause). In any case, most of what you're doing there, I'm just not familiar with. Seems very low level. I've never had to use TokenStreams to build a query before and I'm not really sure what is going on there. Also, I don't know what PositionIncrementAttribute is or how it would be used to create a PhraseQuery. The way I'm currently creating PhraseQuerys is very straightforward and intuitive. E.g. to search for the term "foo bar" I'd build the query like this: PhraseQuery phraseQuery = new PhraseQuery(); phraseQuery.add(new Term("title", "foo")); phraseQuery.add(new Term("title", "bar")); Is there really no easier way to associate the correct analyzer with these types of queries? Bill -Original Message- From: Simon Willnauer [mailto:simon.willna...@gmail.com] Sent: Friday, August 03, 2012 3:43 AM To: java-user@lucene.apache.org; Bill Chesky Subject: Re: Analyzer on query question On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky wrote: > Hi, > > I understand that generally speaking you should use the same analyzer on > querying as was used on indexing. In my code I am using the SnowballAnalyzer > on index creation. However, on the query side I am building up a complex > BooleanQuery from other BooleanQuerys and/or PhraseQuerys on several fields. > None of these require specifying an analyzer anywhere. This is causing some > odd results, I think, because a different analyzer (or no analyzer?) is being > used for the query. > > Question: how do I build my boolean and phrase queries using the > SnowballAnalyzer? > > One thing I did that seemed to kind of work was to build my complex query > normally then build a snowball-analyzed query using a QueryParser > instantiated with a SnowballAnalyzer. To do this, I simply pass the string > value of the complex query to the QueryParser.parse() method to get the new > query. Something like this: > > // build a complex query from other BooleanQuerys and PhraseQuerys > BooleanQuery fullQuery = buildComplexQuery(); > QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new > SnowballAnalyzer(Version.LUCENE_30, "English")); > Query snowballAnalyzedQuery = parser.parse(fullQuery.toString()); > > TopScoreDocCollector collector = TopScoreDocCollector.create(1, true); > indexSearcher.search(snowballAnalyzedQuery, collector); you can just use the analyzer directly like this: Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English"); TokenStream stream = analyzer.tokenStream("title", new StringReader(fullQuery.toString()): CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class); stream.reset(); BooleanQuery q = new BooleanQuery(); while(stream.incrementToken()) { q.addClause(new BooleanClause(Occur.MUST, new Term("title", termAttr.toString(; } you also have access to the token positions if you want to create phrase queries etc. just add a PositionIncrementAttribute like this: PositionIncrementAttribute posAttr = stream.addAttribute(PositionsIncrementAttribute.class); pls. doublecheck the code it's straight from the top of my head. simon > > Like I said, this seems to kind of work but it doesn't feel right. Does this > make sense? Is there a better way? > > thanks in advance, > > Bill
Re: Analyzer on query question
You can add parsed queries to a BooleanQuery. Would that help in this case? SnowballAnalyzer sba = whatever(); QueryParser qp = new QueryParser(..., sba); Query q1 = qp.parse("some snowball string"); Query q2 = qp.parse("some other snowball string"); BooleanQuery bq = new BooleanQuery(); bq.add(q1, ...); bq.add(q2, ...); bq.add(loads of other stuff); -- ian. On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky wrote: > Thanks Simon, > > Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem to > have been introduced until 3.1.0. Similarly my version of Lucene does not > have a BooleanQuery.addClause(BooleanClause) method. Maybe you meant > BooleanQuery.add(BooleanClause). > > In any case, most of what you're doing there, I'm just not familiar with. > Seems very low level. I've never had to use TokenStreams to build a query > before and I'm not really sure what is going on there. Also, I don't know > what PositionIncrementAttribute is or how it would be used to create a > PhraseQuery. The way I'm currently creating PhraseQuerys is very > straightforward and intuitive. E.g. to search for the term "foo bar" I'd > build the query like this: > > PhraseQuery phraseQuery = new > PhraseQuery(); > phraseQuery.add(new > Term("title", "foo")); > phraseQuery.add(new > Term("title", "bar")); > > Is there really no easier way to associate the correct analyzer with these > types of queries? > > Bill > > -Original Message- > From: Simon Willnauer [mailto:simon.willna...@gmail.com] > Sent: Friday, August 03, 2012 3:43 AM > To: java-user@lucene.apache.org; Bill Chesky > Subject: Re: Analyzer on query question > > On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky > wrote: >> Hi, >> >> I understand that generally speaking you should use the same analyzer on >> querying as was used on indexing. In my code I am using the >> SnowballAnalyzer on index creation. However, on the query side I am >> building up a complex BooleanQuery from other BooleanQuerys and/or >> PhraseQuerys on several fields. None of these require specifying an >> analyzer anywhere. This is causing some odd results, I think, because a >> different analyzer (or no analyzer?) is being used for the query. >> >> Question: how do I build my boolean and phrase queries using the >> SnowballAnalyzer? >> >> One thing I did that seemed to kind of work was to build my complex query >> normally then build a snowball-analyzed query using a QueryParser >> instantiated with a SnowballAnalyzer. To do this, I simply pass the string >> value of the complex query to the QueryParser.parse() method to get the new >> query. Something like this: >> >> // build a complex query from other BooleanQuerys and PhraseQuerys >> BooleanQuery fullQuery = buildComplexQuery(); >> QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new >> SnowballAnalyzer(Version.LUCENE_30, "English")); >> Query snowballAnalyzedQuery = parser.parse(fullQuery.toString()); >> >> TopScoreDocCollector collector = TopScoreDocCollector.create(1, >> true); >> indexSearcher.search(snowballAnalyzedQuery, collector); > > you can just use the analyzer directly like this: > Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English"); > > TokenStream stream = analyzer.tokenStream("title", new > StringReader(fullQuery.toString()): > CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class); > stream.reset(); > BooleanQuery q = new BooleanQuery(); > while(stream.incrementToken()) { > q.addClause(new BooleanClause(Occur.MUST, new Term("title", > termAttr.toString(; > } > > you also have access to the token positions if you want to create > phrase queries etc. just add a PositionIncrementAttribute like this: > PositionIncrementAttribute posAttr = > stream.addAttribute(PositionsIncrementAttribute.class); > > pls. doublecheck the code it's straight from the top of my head. > > simon > >> >> Like I said, this seems to kind of work but it doesn't feel right. Does >> this make sense? Is there a better way? >> >> thanks in advance, >> >> Bill > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Analyzer on query question
Bill, the simple answer to your original question is that in general you should apply the same or similar analysis for your query terms as you do with your indexed data. In your specific case the Query.toString is generating your unanalyzed terms and then the query parser is performing the needed analysis. The real point is that you should be doing the tem analysis before invoking "new Term". Alas, term analysis has changed dramatically over the past couple of years, so the solution to doing analysis before generating a Term/TermQuery will vary from Lucene release to release. We really do need a wiki page for Lucene term analysis. -- Jack Krupansky -Original Message- From: Bill Chesky Sent: Friday, August 03, 2012 9:19 AM To: simon.willna...@gmail.com ; java-user@lucene.apache.org Subject: RE: Analyzer on query question Thanks Simon, Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem to have been introduced until 3.1.0. Similarly my version of Lucene does not have a BooleanQuery.addClause(BooleanClause) method. Maybe you meant BooleanQuery.add(BooleanClause). In any case, most of what you're doing there, I'm just not familiar with. Seems very low level. I've never had to use TokenStreams to build a query before and I'm not really sure what is going on there. Also, I don't know what PositionIncrementAttribute is or how it would be used to create a PhraseQuery. The way I'm currently creating PhraseQuerys is very straightforward and intuitive. E.g. to search for the term "foo bar" I'd build the query like this: PhraseQuery phraseQuery = new PhraseQuery(); phraseQuery.add(new Term("title", "foo")); phraseQuery.add(new Term("title", "bar")); Is there really no easier way to associate the correct analyzer with these types of queries? Bill -Original Message- From: Simon Willnauer [mailto:simon.willna...@gmail.com] Sent: Friday, August 03, 2012 3:43 AM To: java-user@lucene.apache.org; Bill Chesky Subject: Re: Analyzer on query question On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky wrote: Hi, I understand that generally speaking you should use the same analyzer on querying as was used on indexing. In my code I am using the SnowballAnalyzer on index creation. However, on the query side I am building up a complex BooleanQuery from other BooleanQuerys and/or PhraseQuerys on several fields. None of these require specifying an analyzer anywhere. This is causing some odd results, I think, because a different analyzer (or no analyzer?) is being used for the query. Question: how do I build my boolean and phrase queries using the SnowballAnalyzer? One thing I did that seemed to kind of work was to build my complex query normally then build a snowball-analyzed query using a QueryParser instantiated with a SnowballAnalyzer. To do this, I simply pass the string value of the complex query to the QueryParser.parse() method to get the new query. Something like this: // build a complex query from other BooleanQuerys and PhraseQuerys BooleanQuery fullQuery = buildComplexQuery(); QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new SnowballAnalyzer(Version.LUCENE_30, "English")); Query snowballAnalyzedQuery = parser.parse(fullQuery.toString()); TopScoreDocCollector collector = TopScoreDocCollector.create(1, true); indexSearcher.search(snowballAnalyzedQuery, collector); you can just use the analyzer directly like this: Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English"); TokenStream stream = analyzer.tokenStream("title", new StringReader(fullQuery.toString()): CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class); stream.reset(); BooleanQuery q = new BooleanQuery(); while(stream.incrementToken()) { q.addClause(new BooleanClause(Occur.MUST, new Term("title", termAttr.toString(; } you also have access to the token positions if you want to create phrase queries etc. just add a PositionIncrementAttribute like this: PositionIncrementAttribute posAttr = stream.addAttribute(PositionsIncrementAttribute.class); pls. doublecheck the code it's straight from the top of my head. simon Like I said, this seems to kind of work but it doesn't feel right. Does this make sense? Is there a better way? thanks in advance, Bill -- T ususcib, -mil jvausr-nsbs...@ucneapch.ogfo adiioalcomads emal:jaa-se-hlpluen.aace.rg - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Problem with near realtime search
I am trying to (mis)use Lucene a bit like a NoSQL database or, rather, a persistent map. I am entering 38000 documents at a rate of 1000/s to the index. Because each item add may be actually an update, I have a sequence of read/change/write for each of the documents. All goes well until when just after writing the last item, I run a query that retrieves about 16000 documents. All docids are collected in a Collector, and, yes, I make sure to rebase the docIds. Then I iterate over all docIds found and retrieve the documents basically like this: for(int docId : docIds) { Document d = getSearcher().doc(docId); .. } where getSearcher() uses IndexReader.openIfChanged() to always get the most current searcher and makes sure to eventually close the old searcher. At document 15940 I get an exception like this: Exception in thread "main" java.lang.IllegalArgumentException: docID must be >= 0 and < maxDoc=1 (got docID=1) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490) at org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264) I can get rid of the Exception by one of two ways that I both don't like: 1) Put a Thread.sleep(1000) just before running the query+document retrieval part. 2) Use the same IndexSearcher to retrieve all documents instead of calling getSearcher for each document retrieval. This is just a test single threaded test program. I only see Lucene Merge threads in jvisualvm besides the main thread. A breakpoint on the exception shows that org.apache.lucene.index.DirectoryReader.document does seem to have wrong segments, which triggers the Exception. Since Lucene 3.6.1 is in productive use for some time I doubt it is a bug in Lucene, but I don't see what I am doing wrong. It might be connected to trying to get the freshest IndexReader for retrieving documents. Any better ideas or explanations? Harald. -- Harald Kirsch - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problem with near realtime search
hey harald, if you use a possibly different searcher (reader) than you used for the search you will run into problems with the doc IDs since they might change during the request. I suggest you to use SearcherManager or NRTMangager and carry on the searcher reference when you collect the stored values. Just keep around the searcher you used and NRTManager / SearcherManager will do the job for you. simon On Fri, Aug 3, 2012 at 3:41 PM, Harald Kirsch wrote: > I am trying to (mis)use Lucene a bit like a NoSQL database or, rather, a > persistent map. I am entering 38000 documents at a rate of 1000/s to the > index. Because each item add may be actually an update, I have a sequence of > read/change/write for each of the documents. > > All goes well until when just after writing the last item, I run a query > that retrieves about 16000 documents. All docids are collected in a > Collector, and, yes, I make sure to rebase the docIds. Then I iterate over > all docIds found and retrieve the documents basically like this: > > for(int docId : docIds) { > Document d = getSearcher().doc(docId); > .. > } > > where getSearcher() uses IndexReader.openIfChanged() to always get the most > current searcher and makes sure to eventually close the old searcher. > > > At document 15940 I get an exception like this: > > Exception in thread "main" java.lang.IllegalArgumentException: docID must be >>= 0 and < maxDoc=1 (got docID=1) > at > org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490) > at > org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568) > at > org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264) > > I can get rid of the Exception by one of two ways that I both don't like: > > 1) Put a Thread.sleep(1000) just before running the query+document retrieval > part. > > 2) Use the same IndexSearcher to retrieve all documents instead of calling > getSearcher for each document retrieval. > > This is just a test single threaded test program. I only see Lucene Merge > threads in jvisualvm besides the main thread. A breakpoint on the exception > shows that org.apache.lucene.index.DirectoryReader.document does seem to > have wrong segments, which triggers the Exception. > > Since Lucene 3.6.1 is in productive use for some time I doubt it is a bug in > Lucene, but I don't see what I am doing wrong. It might be connected to > trying to get the freshest IndexReader for retrieving documents. > > Any better ideas or explanations? > > Harald. > > -- > Harald Kirsch > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Analyzer on query question
Jack, Thanks. Yeah, I don't know what you mean be term analysis. I googled it but didn't come up with much. So if that is the preferred way of doing this, a wiki document would be greatly appreciated. I notice you did say I should be doing the term analysis first. But is it wrong to do it the way I described in my original email? Will it give me incorrect results? Bill -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, August 03, 2012 9:33 AM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question Bill, the simple answer to your original question is that in general you should apply the same or similar analysis for your query terms as you do with your indexed data. In your specific case the Query.toString is generating your unanalyzed terms and then the query parser is performing the needed analysis. The real point is that you should be doing the tem analysis before invoking "new Term". Alas, term analysis has changed dramatically over the past couple of years, so the solution to doing analysis before generating a Term/TermQuery will vary from Lucene release to release. We really do need a wiki page for Lucene term analysis. -- Jack Krupansky -Original Message- From: Bill Chesky Sent: Friday, August 03, 2012 9:19 AM To: simon.willna...@gmail.com ; java-user@lucene.apache.org Subject: RE: Analyzer on query question Thanks Simon, Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem to have been introduced until 3.1.0. Similarly my version of Lucene does not have a BooleanQuery.addClause(BooleanClause) method. Maybe you meant BooleanQuery.add(BooleanClause). In any case, most of what you're doing there, I'm just not familiar with. Seems very low level. I've never had to use TokenStreams to build a query before and I'm not really sure what is going on there. Also, I don't know what PositionIncrementAttribute is or how it would be used to create a PhraseQuery. The way I'm currently creating PhraseQuerys is very straightforward and intuitive. E.g. to search for the term "foo bar" I'd build the query like this: PhraseQuery phraseQuery = new PhraseQuery(); phraseQuery.add(new Term("title", "foo")); phraseQuery.add(new Term("title", "bar")); Is there really no easier way to associate the correct analyzer with these types of queries? Bill -Original Message- From: Simon Willnauer [mailto:simon.willna...@gmail.com] Sent: Friday, August 03, 2012 3:43 AM To: java-user@lucene.apache.org; Bill Chesky Subject: Re: Analyzer on query question On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky wrote: > Hi, > > I understand that generally speaking you should use the same analyzer on > querying as was used on indexing. In my code I am using the > SnowballAnalyzer on index creation. However, on the query side I am > building up a complex BooleanQuery from other BooleanQuerys and/or > PhraseQuerys on several fields. None of these require specifying an > analyzer anywhere. This is causing some odd results, I think, because a > different analyzer (or no analyzer?) is being used for the query. > > Question: how do I build my boolean and phrase queries using the > SnowballAnalyzer? > > One thing I did that seemed to kind of work was to build my complex query > normally then build a snowball-analyzed query using a QueryParser > instantiated with a SnowballAnalyzer. To do this, I simply pass the > string value of the complex query to the QueryParser.parse() method to get > the new query. Something like this: > > // build a complex query from other BooleanQuerys and PhraseQuerys > BooleanQuery fullQuery = buildComplexQuery(); > QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new > SnowballAnalyzer(Version.LUCENE_30, "English")); > Query snowballAnalyzedQuery = parser.parse(fullQuery.toString()); > > TopScoreDocCollector collector = TopScoreDocCollector.create(1, > true); > indexSearcher.search(snowballAnalyzedQuery, collector); you can just use the analyzer directly like this: Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English"); TokenStream stream = analyzer.tokenStream("title", new StringReader(fullQuery.toString()): CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class); stream.reset(); BooleanQuery q = new BooleanQuery(); while(stream.incrementToken()) { q.addClause(new BooleanClause(Occur.MUST, new Term("title", termAttr.toString(; } you also have access to the token positions if you want to create phrase queries etc. just add a PositionIncrementAttribute like this: PositionIncrementAttribute posAttr = stream.addAttribute(PositionsIncrementAttribute.class); pls. doublecheck the code it's straight from the top of my head. simon > > Like I said, this seems to kind of work but it doesn't feel right. Does > this make sense? Is there a better way? > > thanks in advance, > > Bill
RE: Analyzer on query question
Ian, I gave this method a try, at least the way I understood your suggestion. E.g. to search for the phrase "cells combine" I built up a string like: title:"cells combine" description:"cells combine" text:"cells combine" then I passed that to the queryParser.parse() method (where queryParser is an instance of QueryParser constructed using SnowballAnalyzer) and added the result as a MUST clause in my final BooleanQuery. When I print the resulting query out as a string I get: +(title:"cell combin" description:"cell combin" keywords:"cell combin") So it looks like the SnowballAnalyzer is doing some stemming for me. But this is the exact same result I'd get doing it the way I described in my original email. I just built the unanalyzed string on my own rather than using the various query classes like PhraseQuery, etc. So I don't see the advantage to doing it this way over the original method. I just don't know if the original way I described is wrong or will give me bad results. thanks for the help, Bill -Original Message- From: Ian Lea [mailto:ian@gmail.com] Sent: Friday, August 03, 2012 9:32 AM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question You can add parsed queries to a BooleanQuery. Would that help in this case? SnowballAnalyzer sba = whatever(); QueryParser qp = new QueryParser(..., sba); Query q1 = qp.parse("some snowball string"); Query q2 = qp.parse("some other snowball string"); BooleanQuery bq = new BooleanQuery(); bq.add(q1, ...); bq.add(q2, ...); bq.add(loads of other stuff); -- ian. On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky wrote: > Thanks Simon, > > Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem to > have been introduced until 3.1.0. Similarly my version of Lucene does not > have a BooleanQuery.addClause(BooleanClause) method. Maybe you meant > BooleanQuery.add(BooleanClause). > > In any case, most of what you're doing there, I'm just not familiar with. > Seems very low level. I've never had to use TokenStreams to build a query > before and I'm not really sure what is going on there. Also, I don't know > what PositionIncrementAttribute is or how it would be used to create a > PhraseQuery. The way I'm currently creating PhraseQuerys is very > straightforward and intuitive. E.g. to search for the term "foo bar" I'd > build the query like this: > > PhraseQuery phraseQuery = new > PhraseQuery(); > phraseQuery.add(new > Term("title", "foo")); > phraseQuery.add(new > Term("title", "bar")); > > Is there really no easier way to associate the correct analyzer with these > types of queries? > > Bill > > -Original Message- > From: Simon Willnauer [mailto:simon.willna...@gmail.com] > Sent: Friday, August 03, 2012 3:43 AM > To: java-user@lucene.apache.org; Bill Chesky > Subject: Re: Analyzer on query question > > On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky > wrote: >> Hi, >> >> I understand that generally speaking you should use the same analyzer on >> querying as was used on indexing. In my code I am using the >> SnowballAnalyzer on index creation. However, on the query side I am >> building up a complex BooleanQuery from other BooleanQuerys and/or >> PhraseQuerys on several fields. None of these require specifying an >> analyzer anywhere. This is causing some odd results, I think, because a >> different analyzer (or no analyzer?) is being used for the query. >> >> Question: how do I build my boolean and phrase queries using the >> SnowballAnalyzer? >> >> One thing I did that seemed to kind of work was to build my complex query >> normally then build a snowball-analyzed query using a QueryParser >> instantiated with a SnowballAnalyzer. To do this, I simply pass the string >> value of the complex query to the QueryParser.parse() method to get the new >> query. Something like this: >> >> // build a complex query from other BooleanQuerys and PhraseQuerys >> BooleanQuery fullQuery = buildComplexQuery(); >> QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new >> SnowballAnalyzer(Version.LUCENE_30, "English")); >> Query snowballAnalyzedQuery = parser.parse(fullQuery.toString()); >> >> TopScoreDocCollector collector = TopScoreDocCollector.create(1, >> true); >> indexSearcher.search(snowballAnalyzedQuery, collector); > > you can just use the analyzer directly like this: > Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English"); > > TokenStream stream = analyzer.tokenStream("title", new > StringReader(fullQuery.toString()): > CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class); > stream.reset(); > BooleanQuery q = new BooleanQuery(); > while(stream.incrementToken()) { > q.addClause(new BooleanClause(Occur.MUST, new Term("title", > termAttr.toString
Re: Analyzer on query question
Bill You're getting the snowball stemming either way which I guess is good, and if you get same results either way maybe it doesn't matter which technique you use. I'd be a bit worried about parsing the result of query.toString() because you aren't guaranteed to get back, in text, what you put in. My way seems better to me, but then it would. If you prefer your way I won't argue with you. -- Ian. On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky wrote: > Ian, > > I gave this method a try, at least the way I understood your suggestion. E.g. > to search for the phrase "cells combine" I built up a string like: > > title:"cells combine" description:"cells combine" text:"cells combine" > > then I passed that to the queryParser.parse() method (where queryParser is an > instance of QueryParser constructed using SnowballAnalyzer) and added the > result as a MUST clause in my final BooleanQuery. > > When I print the resulting query out as a string I get: > > +(title:"cell combin" description:"cell combin" keywords:"cell combin") > > So it looks like the SnowballAnalyzer is doing some stemming for me. But > this is the exact same result I'd get doing it the way I described in my > original email. I just built the unanalyzed string on my own rather than > using the various query classes like PhraseQuery, etc. > > So I don't see the advantage to doing it this way over the original method. > I just don't know if the original way I described is wrong or will give me > bad results. > > thanks for the help, > > Bill > > -Original Message- > From: Ian Lea [mailto:ian@gmail.com] > Sent: Friday, August 03, 2012 9:32 AM > To: java-user@lucene.apache.org > Subject: Re: Analyzer on query question > > You can add parsed queries to a BooleanQuery. Would that help in this case? > > SnowballAnalyzer sba = whatever(); > QueryParser qp = new QueryParser(..., sba); > Query q1 = qp.parse("some snowball string"); > Query q2 = qp.parse("some other snowball string"); > > BooleanQuery bq = new BooleanQuery(); > bq.add(q1, ...); > bq.add(q2, ...); > bq.add(loads of other stuff); > > > -- > ian. > > > On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky > wrote: >> Thanks Simon, >> >> Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem to >> have been introduced until 3.1.0. Similarly my version of Lucene does not >> have a BooleanQuery.addClause(BooleanClause) method. Maybe you meant >> BooleanQuery.add(BooleanClause). > >> >> In any case, most of what you're doing there, I'm just not familiar with. >> Seems very low level. I've never had to use TokenStreams to build a query >> before and I'm not really sure what is going on there. Also, I don't know >> what PositionIncrementAttribute is or how it would be used to create a >> PhraseQuery. The way I'm currently creating PhraseQuerys is very >> straightforward and intuitive. E.g. to search for the term "foo bar" I'd >> build the query like this: >> >> PhraseQuery phraseQuery = >> new PhraseQuery(); >> phraseQuery.add(new >> Term("title", "foo")); >> phraseQuery.add(new >> Term("title", "bar")); >> >> Is there really no easier way to associate the correct analyzer with these >> types of queries? >> >> Bill >> >> -Original Message- >> From: Simon Willnauer [mailto:simon.willna...@gmail.com] >> Sent: Friday, August 03, 2012 3:43 AM >> To: java-user@lucene.apache.org; Bill Chesky >> Subject: Re: Analyzer on query question >> >> On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky >> wrote: >>> Hi, >>> >>> I understand that generally speaking you should use the same analyzer on >>> querying as was used on indexing. In my code I am using the >>> SnowballAnalyzer on index creation. However, on the query side I am >>> building up a complex BooleanQuery from other BooleanQuerys and/or >>> PhraseQuerys on several fields. None of these require specifying an >>> analyzer anywhere. This is causing some odd results, I think, because a >>> different analyzer (or no analyzer?) is being used for the query. >>> >>> Question: how do I build my boolean and phrase queries using the >>> SnowballAnalyzer? >>> >>> One thing I did that seemed to kind of work was to build my complex query >>> normally then build a snowball-analyzed query using a QueryParser >>> instantiated with a SnowballAnalyzer. To do this, I simply pass the string >>> value of the complex query to the QueryParser.parse() method to get the new >>> query. Something like this: >>> >>> // build a complex query from other BooleanQuerys and PhraseQuerys >>> BooleanQuery fullQuery = buildComplexQuery(); >>> QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new >>> SnowballAnalyzer(Version.LUCENE_30, "English")); >>> Query snowballAnalyzedQuery = parser.parse(fullQuery.toString()); >>> >>> TopScoreD
Re: Analyzer on query question
Bill, the re-parse of Query.toString will work provided that your query terms are either un-analyzed or their analyzer is "idempotent" (can be applied repeatedly without changing the output terms.) In your case, you are doing the former. The bottom line: 1) if it works for you, great, 2) for other readers, please do not depend on this approach if your input data is filtered in any way - if your index analyzer "filters" terms (e.g, stemming, case changes, term-splitting), your Term/TermQuery should be analyzed/filtered comparably, in which case the extra parse (to cause term analysis such as stemming) becomes unnecessary and risky if you are not very careful or very lucky. -- Jack Krupansky -Original Message- From: Ian Lea Sent: Friday, August 03, 2012 1:12 PM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question Bill You're getting the snowball stemming either way which I guess is good, and if you get same results either way maybe it doesn't matter which technique you use. I'd be a bit worried about parsing the result of query.toString() because you aren't guaranteed to get back, in text, what you put in. My way seems better to me, but then it would. If you prefer your way I won't argue with you. -- Ian. On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky wrote: Ian, I gave this method a try, at least the way I understood your suggestion. E.g. to search for the phrase "cells combine" I built up a string like: title:"cells combine" description:"cells combine" text:"cells combine" then I passed that to the queryParser.parse() method (where queryParser is an instance of QueryParser constructed using SnowballAnalyzer) and added the result as a MUST clause in my final BooleanQuery. When I print the resulting query out as a string I get: +(title:"cell combin" description:"cell combin" keywords:"cell combin") So it looks like the SnowballAnalyzer is doing some stemming for me. But this is the exact same result I'd get doing it the way I described in my original email. I just built the unanalyzed string on my own rather than using the various query classes like PhraseQuery, etc. So I don't see the advantage to doing it this way over the original method. I just don't know if the original way I described is wrong or will give me bad results. thanks for the help, Bill -Original Message- From: Ian Lea [mailto:ian@gmail.com] Sent: Friday, August 03, 2012 9:32 AM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question You can add parsed queries to a BooleanQuery. Would that help in this case? SnowballAnalyzer sba = whatever(); QueryParser qp = new QueryParser(..., sba); Query q1 = qp.parse("some snowball string"); Query q2 = qp.parse("some other snowball string"); BooleanQuery bq = new BooleanQuery(); bq.add(q1, ...); bq.add(q2, ...); bq.add(loads of other stuff); -- ian. On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky wrote: Thanks Simon, Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem to have been introduced until 3.1.0. Similarly my version of Lucene does not have a BooleanQuery.addClause(BooleanClause) method. Maybe you meant BooleanQuery.add(BooleanClause). In any case, most of what you're doing there, I'm just not familiar with. Seems very low level. I've never had to use TokenStreams to build a query before and I'm not really sure what is going on there. Also, I don't know what PositionIncrementAttribute is or how it would be used to create a PhraseQuery. The way I'm currently creating PhraseQuerys is very straightforward and intuitive. E.g. to search for the term "foo bar" I'd build the query like this: PhraseQuery phraseQuery = new PhraseQuery(); phraseQuery.add(new Term("title", "foo")); phraseQuery.add(new Term("title", "bar")); Is there really no easier way to associate the correct analyzer with these types of queries? Bill -Original Message- From: Simon Willnauer [mailto:simon.willna...@gmail.com] Sent: Friday, August 03, 2012 3:43 AM To: java-user@lucene.apache.org; Bill Chesky Subject: Re: Analyzer on query question On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky wrote: Hi, I understand that generally speaking you should use the same analyzer on querying as was used on indexing. In my code I am using the SnowballAnalyzer on index creation. However, on the query side I am building up a complex BooleanQuery from other BooleanQuerys and/or PhraseQuerys on several fields. None of these require specifying an analyzer anywhere. This is causing some odd results, I think, because a different analyzer (or no analyzer?) is being used for the query. Question: how do I build my boolean and phrase queries using the SnowballAnalyzer? One thing I did that seemed to kind of work was to build my
RE: Analyzer on query question
Ian/Jack, Ok, thanks for the help. I certainly don't want to take a cheap way out, hence my original question about whether this is the right way to do this. Jack, you say the right way is to do Term analysis before creating the Term. If anybody has any information on how to accomplish this I'd greatly appreciate it. regards, Bill -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, August 03, 2012 1:22 PM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question Bill, the re-parse of Query.toString will work provided that your query terms are either un-analyzed or their analyzer is "idempotent" (can be applied repeatedly without changing the output terms.) In your case, you are doing the former. The bottom line: 1) if it works for you, great, 2) for other readers, please do not depend on this approach if your input data is filtered in any way - if your index analyzer "filters" terms (e.g, stemming, case changes, term-splitting), your Term/TermQuery should be analyzed/filtered comparably, in which case the extra parse (to cause term analysis such as stemming) becomes unnecessary and risky if you are not very careful or very lucky. -- Jack Krupansky -Original Message- From: Ian Lea Sent: Friday, August 03, 2012 1:12 PM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question Bill You're getting the snowball stemming either way which I guess is good, and if you get same results either way maybe it doesn't matter which technique you use. I'd be a bit worried about parsing the result of query.toString() because you aren't guaranteed to get back, in text, what you put in. My way seems better to me, but then it would. If you prefer your way I won't argue with you. -- Ian. On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky wrote: > Ian, > > I gave this method a try, at least the way I understood your suggestion. > E.g. to search for the phrase "cells combine" I built up a string like: > > title:"cells combine" description:"cells combine" text:"cells combine" > > then I passed that to the queryParser.parse() method (where queryParser is > an instance of QueryParser constructed using SnowballAnalyzer) and added > the result as a MUST clause in my final BooleanQuery. > > When I print the resulting query out as a string I get: > > +(title:"cell combin" description:"cell combin" keywords:"cell combin") > > So it looks like the SnowballAnalyzer is doing some stemming for me. But > this is the exact same result I'd get doing it the way I described in my > original email. I just built the unanalyzed string on my own rather than > using the various query classes like PhraseQuery, etc. > > So I don't see the advantage to doing it this way over the original > method. I just don't know if the original way I described is wrong or > will give me bad results. > > thanks for the help, > > Bill > > -Original Message- > From: Ian Lea [mailto:ian@gmail.com] > Sent: Friday, August 03, 2012 9:32 AM > To: java-user@lucene.apache.org > Subject: Re: Analyzer on query question > > You can add parsed queries to a BooleanQuery. Would that help in this > case? > > SnowballAnalyzer sba = whatever(); > QueryParser qp = new QueryParser(..., sba); > Query q1 = qp.parse("some snowball string"); > Query q2 = qp.parse("some other snowball string"); > > BooleanQuery bq = new BooleanQuery(); > bq.add(q1, ...); > bq.add(q2, ...); > bq.add(loads of other stuff); > > > -- > ian. > > > On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky > wrote: >> Thanks Simon, >> >> Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem >> to have been introduced until 3.1.0. Similarly my version of Lucene does >> not have a BooleanQuery.addClause(BooleanClause) method. Maybe you meant >> BooleanQuery.add(BooleanClause). > >> >> In any case, most of what you're doing there, I'm just not familiar with. >> Seems very low level. I've never had to use TokenStreams to build a >> query before and I'm not really sure what is going on there. Also, I >> don't know what PositionIncrementAttribute is or how it would be used to >> create a PhraseQuery. The way I'm currently creating PhraseQuerys is >> very straightforward and intuitive. E.g. to search for the term "foo >> bar" I'd build the query like this: >> >> PhraseQuery phraseQuery = >> new PhraseQuery(); >> phraseQuery.add(new >> Term("title", "foo")); >> phraseQuery.add(new >> Term("title", "bar")); >> >> Is there really no easier way to associate the correct analyzer with >> these types of queries? >> >> Bill >> >> -Original Message- >> From: Simon Willnauer [mailto:simon.willna...@gmail.com] >> Sent: Friday, August 03, 2012 3:43 AM >> To: java-user@lucene.apache.org; Bill Chesky >> Subject: Re: Analyzer on que
Re: Analyzer on query question
Simon gave sample code for analyzing a multi-term string. Here's some pseudo-code (hasn't been compiled to check it) to analyze a single term with Lucene 3.6: public Term analyzeTerm(Analyzer analyzer, String termString){ TokenStream stream = analyzer.tokenStream(field, new StringReader(termString)); if (stream.incrementToken()) return new Term(stream.getAttribute(CharacterTermAttribute.class).toString()); else return null; // TODO: Close the StringReader // TODO: Handle terms that analyze into multiple terms (e.g., embedded punctuation) } And here's the corresponding code for Lucene 4.0: public Term analyzeTerm(Analyzer analyzer, String termString){ TokenStream stream = analyzer.tokenStream(field, new StringReader(termString)); if (stream.incrementToken()){ TermToBytesRefAttribute termAtt = stream.getAttribute(TermToBytesRefAttribute.class); BytesRef bytes = termAtt.getBytesRef(); return new Term(BytesRef.deepCopyOf(bytes)); } else return null; // TODO: Close the StringReader // TODO: Handle terms that analyze into multiple terms (e.g., embedded punctuation) } -- Jack Krupansky -Original Message- From: Bill Chesky Sent: Friday, August 03, 2012 2:55 PM To: java-user@lucene.apache.org Subject: RE: Analyzer on query question Ian/Jack, Ok, thanks for the help. I certainly don't want to take a cheap way out, hence my original question about whether this is the right way to do this. Jack, you say the right way is to do Term analysis before creating the Term. If anybody has any information on how to accomplish this I'd greatly appreciate it. regards, Bill -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, August 03, 2012 1:22 PM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question Bill, the re-parse of Query.toString will work provided that your query terms are either un-analyzed or their analyzer is "idempotent" (can be applied repeatedly without changing the output terms.) In your case, you are doing the former. The bottom line: 1) if it works for you, great, 2) for other readers, please do not depend on this approach if your input data is filtered in any way - if your index analyzer "filters" terms (e.g, stemming, case changes, term-splitting), your Term/TermQuery should be analyzed/filtered comparably, in which case the extra parse (to cause term analysis such as stemming) becomes unnecessary and risky if you are not very careful or very lucky. -- Jack Krupansky -Original Message- From: Ian Lea Sent: Friday, August 03, 2012 1:12 PM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question Bill You're getting the snowball stemming either way which I guess is good, and if you get same results either way maybe it doesn't matter which technique you use. I'd be a bit worried about parsing the result of query.toString() because you aren't guaranteed to get back, in text, what you put in. My way seems better to me, but then it would. If you prefer your way I won't argue with you. -- Ian. On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky wrote: Ian, I gave this method a try, at least the way I understood your suggestion. E.g. to search for the phrase "cells combine" I built up a string like: title:"cells combine" description:"cells combine" text:"cells combine" then I passed that to the queryParser.parse() method (where queryParser is an instance of QueryParser constructed using SnowballAnalyzer) and added the result as a MUST clause in my final BooleanQuery. When I print the resulting query out as a string I get: +(title:"cell combin" description:"cell combin" keywords:"cell combin") So it looks like the SnowballAnalyzer is doing some stemming for me. But this is the exact same result I'd get doing it the way I described in my original email. I just built the unanalyzed string on my own rather than using the various query classes like PhraseQuery, etc. So I don't see the advantage to doing it this way over the original method. I just don't know if the original way I described is wrong or will give me bad results. thanks for the help, Bill -Original Message- From: Ian Lea [mailto:ian@gmail.com] Sent: Friday, August 03, 2012 9:32 AM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question You can add parsed queries to a BooleanQuery. Would that help in this case? SnowballAnalyzer sba = whatever(); QueryParser qp = new QueryParser(..., sba); Query q1 = qp.parse("some snowball string"); Query q2 = qp.parse("some other snowball string"); BooleanQuery bq = new BooleanQuery(); bq.add(q1, ...); bq.add(q2, ...); bq.add(loads of other stuff); -- ian. On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky wrote: Thanks Simon, Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem to have been introduced until 3.1.0. Similarly my version of Lucene does not have a BooleanQuery.addClause(BooleanClause) me
Re: Analyzer on query question
you must call reset() before consuming any tokenstream. On Fri, Aug 3, 2012 at 4:03 PM, Jack Krupansky wrote: > Simon gave sample code for analyzing a multi-term string. > > Here's some pseudo-code (hasn't been compiled to check it) to analyze a > single term with Lucene 3.6: > > public Term analyzeTerm(Analyzer analyzer, String termString){ > TokenStream stream = analyzer.tokenStream(field, new > StringReader(termString)); > if (stream.incrementToken()) >return new > Term(stream.getAttribute(CharacterTermAttribute.class).toString()); > else >return null; > // TODO: Close the StringReader > // TODO: Handle terms that analyze into multiple terms (e.g., embedded > punctuation) > } > > And here's the corresponding code for Lucene 4.0: > > public Term analyzeTerm(Analyzer analyzer, String termString){ > TokenStream stream = analyzer.tokenStream(field, new > StringReader(termString)); > if (stream.incrementToken()){ >TermToBytesRefAttribute termAtt = > stream.getAttribute(TermToBytesRefAttribute.class); >BytesRef bytes = termAtt.getBytesRef(); >return new Term(BytesRef.deepCopyOf(bytes)); > } else >return null; > // TODO: Close the StringReader > // TODO: Handle terms that analyze into multiple terms (e.g., embedded > punctuation) > } > > -- Jack Krupansky > > -Original Message- From: Bill Chesky > Sent: Friday, August 03, 2012 2:55 PM > To: java-user@lucene.apache.org > > Subject: RE: Analyzer on query question > > Ian/Jack, > > Ok, thanks for the help. I certainly don't want to take a cheap way out, > hence my original question about whether this is the right way to do this. > Jack, you say the right way is to do Term analysis before creating the Term. > If anybody has any information on how to accomplish this I'd greatly > appreciate it. > > regards, > > Bill > > -Original Message- > From: Jack Krupansky [mailto:j...@basetechnology.com] > Sent: Friday, August 03, 2012 1:22 PM > To: java-user@lucene.apache.org > Subject: Re: Analyzer on query question > > Bill, the re-parse of Query.toString will work provided that your query > terms are either un-analyzed or their analyzer is "idempotent" (can be > applied repeatedly without changing the output terms.) In your case, you are > doing the former. > > The bottom line: 1) if it works for you, great, 2) for other readers, please > do not depend on this approach if your input data is filtered in any way - > if your index analyzer "filters" terms (e.g, stemming, case changes, > term-splitting), your Term/TermQuery should be analyzed/filtered comparably, > in which case the extra parse (to cause term analysis such as stemming) > becomes unnecessary and risky if you are not very careful or very lucky. > > -- Jack Krupansky > > -Original Message- From: Ian Lea > Sent: Friday, August 03, 2012 1:12 PM > To: java-user@lucene.apache.org > Subject: Re: Analyzer on query question > > Bill > > > You're getting the snowball stemming either way which I guess is good, > and if you get same results either way maybe it doesn't matter which > technique you use. I'd be a bit worried about parsing the result of > query.toString() because you aren't guaranteed to get back, in text, > what you put in. > > My way seems better to me, but then it would. If you prefer your way > I won't argue with you. > > > -- > Ian. > > > On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky > wrote: >> >> Ian, >> >> I gave this method a try, at least the way I understood your suggestion. >> E.g. to search for the phrase "cells combine" I built up a string like: >> >> title:"cells combine" description:"cells combine" text:"cells combine" >> >> then I passed that to the queryParser.parse() method (where queryParser is >> an instance of QueryParser constructed using SnowballAnalyzer) and added >> the result as a MUST clause in my final BooleanQuery. >> >> When I print the resulting query out as a string I get: >> >> +(title:"cell combin" description:"cell combin" keywords:"cell combin") >> >> So it looks like the SnowballAnalyzer is doing some stemming for me. But >> this is the exact same result I'd get doing it the way I described in my >> original email. I just built the unanalyzed string on my own rather than >> using the various query classes like PhraseQuery, etc. >> >> So I don't see the advantage to doing it this way over the original >> method. I just don't know if the original way I described is wrong or >> will give me bad results. >> >> thanks for the help, >> >> Bill >> >> -Original Message- >> From: Ian Lea [mailto:ian@gmail.com] >> Sent: Friday, August 03, 2012 9:32 AM >> To: java-user@lucene.apache.org >> Subject: Re: Analyzer on query question >> >> You can add parsed queries to a BooleanQuery. Would that help in this >> case? >> >> SnowballAnalyzer sba = whatever(); >> QueryParser qp = new QueryParser(..., sba); >> Query q1 = qp.parse("some snowball string"); >> Query q2 = qp.parse("some other snowball string"); >> >>
Re: Analyzer on query question
I still don't see what Bill gains by doing the term analysis himself rather than letting QueryParser do the hard work, in a portable non-lucene-version-specific way. -- Ian. On Fri, Aug 3, 2012 at 9:39 PM, Robert Muir wrote: > you must call reset() before consuming any tokenstream. > > On Fri, Aug 3, 2012 at 4:03 PM, Jack Krupansky > wrote: >> Simon gave sample code for analyzing a multi-term string. >> >> Here's some pseudo-code (hasn't been compiled to check it) to analyze a >> single term with Lucene 3.6: >> >> public Term analyzeTerm(Analyzer analyzer, String termString){ >> TokenStream stream = analyzer.tokenStream(field, new >> StringReader(termString)); >> if (stream.incrementToken()) >>return new >> Term(stream.getAttribute(CharacterTermAttribute.class).toString()); >> else >>return null; >> // TODO: Close the StringReader >> // TODO: Handle terms that analyze into multiple terms (e.g., embedded >> punctuation) >> } >> >> And here's the corresponding code for Lucene 4.0: >> >> public Term analyzeTerm(Analyzer analyzer, String termString){ >> TokenStream stream = analyzer.tokenStream(field, new >> StringReader(termString)); >> if (stream.incrementToken()){ >>TermToBytesRefAttribute termAtt = >> stream.getAttribute(TermToBytesRefAttribute.class); >>BytesRef bytes = termAtt.getBytesRef(); >>return new Term(BytesRef.deepCopyOf(bytes)); >> } else >>return null; >> // TODO: Close the StringReader >> // TODO: Handle terms that analyze into multiple terms (e.g., embedded >> punctuation) >> } >> >> -- Jack Krupansky >> >> -Original Message- From: Bill Chesky >> Sent: Friday, August 03, 2012 2:55 PM >> To: java-user@lucene.apache.org >> >> Subject: RE: Analyzer on query question >> >> Ian/Jack, >> >> Ok, thanks for the help. I certainly don't want to take a cheap way out, >> hence my original question about whether this is the right way to do this. >> Jack, you say the right way is to do Term analysis before creating the Term. >> If anybody has any information on how to accomplish this I'd greatly >> appreciate it. >> >> regards, >> >> Bill >> >> -Original Message- >> From: Jack Krupansky [mailto:j...@basetechnology.com] >> Sent: Friday, August 03, 2012 1:22 PM >> To: java-user@lucene.apache.org >> Subject: Re: Analyzer on query question >> >> Bill, the re-parse of Query.toString will work provided that your query >> terms are either un-analyzed or their analyzer is "idempotent" (can be >> applied repeatedly without changing the output terms.) In your case, you are >> doing the former. >> >> The bottom line: 1) if it works for you, great, 2) for other readers, please >> do not depend on this approach if your input data is filtered in any way - >> if your index analyzer "filters" terms (e.g, stemming, case changes, >> term-splitting), your Term/TermQuery should be analyzed/filtered comparably, >> in which case the extra parse (to cause term analysis such as stemming) >> becomes unnecessary and risky if you are not very careful or very lucky. >> >> -- Jack Krupansky >> >> -Original Message- From: Ian Lea >> Sent: Friday, August 03, 2012 1:12 PM >> To: java-user@lucene.apache.org >> Subject: Re: Analyzer on query question >> >> Bill >> >> >> You're getting the snowball stemming either way which I guess is good, >> and if you get same results either way maybe it doesn't matter which >> technique you use. I'd be a bit worried about parsing the result of >> query.toString() because you aren't guaranteed to get back, in text, >> what you put in. >> >> My way seems better to me, but then it would. If you prefer your way >> I won't argue with you. >> >> >> -- >> Ian. >> >> >> On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky >> wrote: >>> >>> Ian, >>> >>> I gave this method a try, at least the way I understood your suggestion. >>> E.g. to search for the phrase "cells combine" I built up a string like: >>> >>> title:"cells combine" description:"cells combine" text:"cells combine" >>> >>> then I passed that to the queryParser.parse() method (where queryParser is >>> an instance of QueryParser constructed using SnowballAnalyzer) and added >>> the result as a MUST clause in my final BooleanQuery. >>> >>> When I print the resulting query out as a string I get: >>> >>> +(title:"cell combin" description:"cell combin" keywords:"cell combin") >>> >>> So it looks like the SnowballAnalyzer is doing some stemming for me. But >>> this is the exact same result I'd get doing it the way I described in my >>> original email. I just built the unanalyzed string on my own rather than >>> using the various query classes like PhraseQuery, etc. >>> >>> So I don't see the advantage to doing it this way over the original >>> method. I just don't know if the original way I described is wrong or >>> will give me bad results. >>> >>> thanks for the help, >>> >>> Bill >>> >>> -Original Message- >>> From: Ian Lea [mailto:ian@gmail.com] >>> Sent: Friday, August 03, 201
RE: Analyzer on query question
Thanks for the help everybody. We're using 3.0.1 so I couldn't do exactly what Simon and Jack suggested. But after some searching around I came up with this method: private String analyze(String token) throws Exception { StringBuffer result = new StringBuffer(); Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English"); TokenStream tokenStream = analyzer.tokenStream("title", new StringReader(token)); tokenStream.reset(); TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class); while (tokenStream.incrementToken()) { if (result.length() > 0) { result.append(" "); } result.append(termAttribute.term()); } return result.toString(); } Now I just run my search term strings thru this method first like so: searchTerms = analyze(searchTerms); // now do what I was doing before to build queries... It's still not totally clear what this buys me since ultimately the query looks the same as what was being generated with my original method (perhaps this is Ian's point in his last reply). But I will defer to the gurus. It works. Thanks for all the help. Bill -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, August 03, 2012 4:03 PM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question Simon gave sample code for analyzing a multi-term string. Here's some pseudo-code (hasn't been compiled to check it) to analyze a single term with Lucene 3.6: public Term analyzeTerm(Analyzer analyzer, String termString){ TokenStream stream = analyzer.tokenStream(field, new StringReader(termString)); if (stream.incrementToken()) return new Term(stream.getAttribute(CharacterTermAttribute.class).toString()); else return null; // TODO: Close the StringReader // TODO: Handle terms that analyze into multiple terms (e.g., embedded punctuation) } And here's the corresponding code for Lucene 4.0: public Term analyzeTerm(Analyzer analyzer, String termString){ TokenStream stream = analyzer.tokenStream(field, new StringReader(termString)); if (stream.incrementToken()){ TermToBytesRefAttribute termAtt = stream.getAttribute(TermToBytesRefAttribute.class); BytesRef bytes = termAtt.getBytesRef(); return new Term(BytesRef.deepCopyOf(bytes)); } else return null; // TODO: Close the StringReader // TODO: Handle terms that analyze into multiple terms (e.g., embedded punctuation) } -- Jack Krupansky -Original Message- From: Bill Chesky Sent: Friday, August 03, 2012 2:55 PM To: java-user@lucene.apache.org Subject: RE: Analyzer on query question Ian/Jack, Ok, thanks for the help. I certainly don't want to take a cheap way out, hence my original question about whether this is the right way to do this. Jack, you say the right way is to do Term analysis before creating the Term. If anybody has any information on how to accomplish this I'd greatly appreciate it. regards, Bill -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, August 03, 2012 1:22 PM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question Bill, the re-parse of Query.toString will work provided that your query terms are either un-analyzed or their analyzer is "idempotent" (can be applied repeatedly without changing the output terms.) In your case, you are doing the former. The bottom line: 1) if it works for you, great, 2) for other readers, please do not depend on this approach if your input data is filtered in any way - if your index analyzer "filters" terms (e.g, stemming, case changes, term-splitting), your Term/TermQuery should be analyzed/filtered comparably, in which case the extra parse (to cause term analysis such as stemming) becomes unnecessary and risky if you are not very careful or very lucky. -- Jack Krupansky -Original Message- From: Ian Lea Sent: Friday, August 03, 2012 1:12 PM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question Bill You're getting the snowball stemming either way which I guess is good, and if you get same results either way maybe it doesn't matter which technique you use. I'd be a bit worried about parsing the result of query.toString() because you aren't guaranteed to get back, in text, what you put in. My way seems better to me, but then it would. If you prefer your way I won't argue with you. -- Ian. On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky wrote: > Ian, > > I gave this method a try, at least the way I understood your suggestion. > E.g. to search for the phrase "cells combine" I built up a string like: > > title:"cells combine" description:"cells combine" text:"cells combine"
Re: Analyzer on query question
What it buys you is not having to convert the whole "complex" query to string form, which is not guaranteed to be reparseable for all queries (e.g., "AND" or "-abc" as raw terms would be treated as operators), and then parsing it which will turn around and regenerate the same query structure (you hope). In theory, this will give guarantee fidelity of the query and improve performance (the toString/parse round-trip is not cheap/free.) As I said, the toString/reparse may indeed work for your specific use-case, but isn't quite ideal for general use. -- Jack Krupansky -Original Message- From: Bill Chesky Sent: Friday, August 03, 2012 5:35 PM To: java-user@lucene.apache.org Subject: RE: Analyzer on query question Thanks for the help everybody. We're using 3.0.1 so I couldn't do exactly what Simon and Jack suggested. But after some searching around I came up with this method: private String analyze(String token) throws Exception { StringBuffer result = new StringBuffer(); Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English"); TokenStream tokenStream = analyzer.tokenStream("title", new StringReader(token)); tokenStream.reset(); TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class); while (tokenStream.incrementToken()) { if (result.length() > 0) { result.append(" "); } result.append(termAttribute.term()); } return result.toString(); } Now I just run my search term strings thru this method first like so: searchTerms = analyze(searchTerms); // now do what I was doing before to build queries... It's still not totally clear what this buys me since ultimately the query looks the same as what was being generated with my original method (perhaps this is Ian's point in his last reply). But I will defer to the gurus. It works. Thanks for all the help. Bill -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, August 03, 2012 4:03 PM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question Simon gave sample code for analyzing a multi-term string. Here's some pseudo-code (hasn't been compiled to check it) to analyze a single term with Lucene 3.6: public Term analyzeTerm(Analyzer analyzer, String termString){ TokenStream stream = analyzer.tokenStream(field, new StringReader(termString)); if (stream.incrementToken()) return new Term(stream.getAttribute(CharacterTermAttribute.class).toString()); else return null; // TODO: Close the StringReader // TODO: Handle terms that analyze into multiple terms (e.g., embedded punctuation) } And here's the corresponding code for Lucene 4.0: public Term analyzeTerm(Analyzer analyzer, String termString){ TokenStream stream = analyzer.tokenStream(field, new StringReader(termString)); if (stream.incrementToken()){ TermToBytesRefAttribute termAtt = stream.getAttribute(TermToBytesRefAttribute.class); BytesRef bytes = termAtt.getBytesRef(); return new Term(BytesRef.deepCopyOf(bytes)); } else return null; // TODO: Close the StringReader // TODO: Handle terms that analyze into multiple terms (e.g., embedded punctuation) } -- Jack Krupansky -Original Message- From: Bill Chesky Sent: Friday, August 03, 2012 2:55 PM To: java-user@lucene.apache.org Subject: RE: Analyzer on query question Ian/Jack, Ok, thanks for the help. I certainly don't want to take a cheap way out, hence my original question about whether this is the right way to do this. Jack, you say the right way is to do Term analysis before creating the Term. If anybody has any information on how to accomplish this I'd greatly appreciate it. regards, Bill -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, August 03, 2012 1:22 PM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question Bill, the re-parse of Query.toString will work provided that your query terms are either un-analyzed or their analyzer is "idempotent" (can be applied repeatedly without changing the output terms.) In your case, you are doing the former. The bottom line: 1) if it works for you, great, 2) for other readers, please do not depend on this approach if your input data is filtered in any way - if your index analyzer "filters" terms (e.g, stemming, case changes, term-splitting), your Term/TermQuery should be analyzed/filtered comparably, in which case the extra parse (to cause term analysis such as stemming) becomes unnecessary and risky if you are not very careful or very lucky. -- Jack Krupansky -Original Message- From: Ian Lea Sent: Friday, August 03, 2012 1:12 PM To: java-user@lucene.apache.org Subject: Re: Analyzer on query question Bill You're getting the snowball stemming either way which I guess is good, and if you get same results either way maybe it doesn't matter which technique you use. I'd be a bit worried about parsing the result of query.toString() because you aren't guaranteed to ge
Re: Problem with near realtime search
Hello Simon, thanks for the information. I really thought that once a docId is assigned it is kept until the document is deleted. The only problem I would have expected are docIds that no longer refer to a document, because it was deleted in the meantime. But this is clearly not the case in my setup. But if docIds change during index rearrangement, then this would of course completely explain the symptoms I saw. So docIds can definitively change under the hood? Harald. Am 03.08.2012 17:24, schrieb Simon Willnauer: hey harald, if you use a possibly different searcher (reader) than you used for the search you will run into problems with the doc IDs since they might change during the request. I suggest you to use SearcherManager or NRTMangager and carry on the searcher reference when you collect the stored values. Just keep around the searcher you used and NRTManager / SearcherManager will do the job for you. simon On Fri, Aug 3, 2012 at 3:41 PM, Harald Kirsch wrote: I am trying to (mis)use Lucene a bit like a NoSQL database or, rather, a persistent map. I am entering 38000 documents at a rate of 1000/s to the index. Because each item add may be actually an update, I have a sequence of read/change/write for each of the documents. All goes well until when just after writing the last item, I run a query that retrieves about 16000 documents. All docids are collected in a Collector, and, yes, I make sure to rebase the docIds. Then I iterate over all docIds found and retrieve the documents basically like this: for(int docId : docIds) { Document d = getSearcher().doc(docId); .. } where getSearcher() uses IndexReader.openIfChanged() to always get the most current searcher and makes sure to eventually close the old searcher. At document 15940 I get an exception like this: Exception in thread "main" java.lang.IllegalArgumentException: docID must be = 0 and < maxDoc=1 (got docID=1) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490) at org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264) I can get rid of the Exception by one of two ways that I both don't like: 1) Put a Thread.sleep(1000) just before running the query+document retrieval part. 2) Use the same IndexSearcher to retrieve all documents instead of calling getSearcher for each document retrieval. This is just a test single threaded test program. I only see Lucene Merge threads in jvisualvm besides the main thread. A breakpoint on the exception shows that org.apache.lucene.index.DirectoryReader.document does seem to have wrong segments, which triggers the Exception. Since Lucene 3.6.1 is in productive use for some time I doubt it is a bug in Lucene, but I don't see what I am doing wrong. It might be connected to trying to get the freshest IndexReader for retrieving documents. Any better ideas or explanations? Harald. -- Harald Kirsch - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Harald Kirsch Raytion GmbH Kaiser-Friedrich-Ring 74 40547 Duesseldorf Fon +49-211-550266-0 Fax +49-211-550266-19 http://www.raytion.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problem with near realtime search
Hello Simon, now that I knew what to search for I found http://wiki.apache.org/lucene-java/LuceneFAQ#When_is_it_possible_for_document_IDs_to_change.3F So that clearly explains this issue for me. Many thanks for your help. Harald Am 04.08.2012 07:38, schrieb Harald Kirsch: Hello Simon, thanks for the information. I really thought that once a docId is assigned it is kept until the document is deleted. The only problem I would have expected are docIds that no longer refer to a document, because it was deleted in the meantime. But this is clearly not the case in my setup. But if docIds change during index rearrangement, then this would of course completely explain the symptoms I saw. So docIds can definitively change under the hood? Harald. Am 03.08.2012 17:24, schrieb Simon Willnauer: hey harald, if you use a possibly different searcher (reader) than you used for the search you will run into problems with the doc IDs since they might change during the request. I suggest you to use SearcherManager or NRTMangager and carry on the searcher reference when you collect the stored values. Just keep around the searcher you used and NRTManager / SearcherManager will do the job for you. simon On Fri, Aug 3, 2012 at 3:41 PM, Harald Kirsch wrote: I am trying to (mis)use Lucene a bit like a NoSQL database or, rather, a persistent map. I am entering 38000 documents at a rate of 1000/s to the index. Because each item add may be actually an update, I have a sequence of read/change/write for each of the documents. All goes well until when just after writing the last item, I run a query that retrieves about 16000 documents. All docids are collected in a Collector, and, yes, I make sure to rebase the docIds. Then I iterate over all docIds found and retrieve the documents basically like this: for(int docId : docIds) { Document d = getSearcher().doc(docId); .. } where getSearcher() uses IndexReader.openIfChanged() to always get the most current searcher and makes sure to eventually close the old searcher. At document 15940 I get an exception like this: Exception in thread "main" java.lang.IllegalArgumentException: docID must be = 0 and < maxDoc=1 (got docID=1) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490) at org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264) I can get rid of the Exception by one of two ways that I both don't like: 1) Put a Thread.sleep(1000) just before running the query+document retrieval part. 2) Use the same IndexSearcher to retrieve all documents instead of calling getSearcher for each document retrieval. This is just a test single threaded test program. I only see Lucene Merge threads in jvisualvm besides the main thread. A breakpoint on the exception shows that org.apache.lucene.index.DirectoryReader.document does seem to have wrong segments, which triggers the Exception. Since Lucene 3.6.1 is in productive use for some time I doubt it is a bug in Lucene, but I don't see what I am doing wrong. It might be connected to trying to get the freshest IndexReader for retrieving documents. Any better ideas or explanations? Harald. -- Harald Kirsch - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Harald Kirsch Raytion GmbH Kaiser-Friedrich-Ring 74 40547 Duesseldorf Fon +49-211-550266-0 Fax +49-211-550266-19 http://www.raytion.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org