Re: Analyzer on query question

2012-08-03 Thread Simon Willnauer
On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky
 wrote:
> Hi,
>
> I understand that generally speaking you should use the same analyzer on 
> querying as was used on indexing.  In my code I am using the SnowballAnalyzer 
> on index creation.  However, on the query side I am building up a complex 
> BooleanQuery from other BooleanQuerys and/or PhraseQuerys on several fields.  
> None of these require specifying an analyzer anywhere.  This is causing some 
> odd results, I think, because a different analyzer (or no analyzer?) is being 
> used for the query.
>
> Question: how do I build my boolean and phrase queries using the 
> SnowballAnalyzer?
>
> One thing I did that seemed to kind of work was to build my complex query 
> normally then build a snowball-analyzed query using a QueryParser 
> instantiated with a SnowballAnalyzer.  To do this, I simply pass the string 
> value of the complex query to the QueryParser.parse() method to get the new 
> query.  Something like this:
>
> // build a complex query from other BooleanQuerys and PhraseQuerys
> BooleanQuery fullQuery = buildComplexQuery();
> QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new 
> SnowballAnalyzer(Version.LUCENE_30, "English"));
> Query snowballAnalyzedQuery = parser.parse(fullQuery.toString());
>
> TopScoreDocCollector collector = TopScoreDocCollector.create(1, true);
> indexSearcher.search(snowballAnalyzedQuery, collector);

you can just use the analyzer directly like this:
Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");

TokenStream stream = analyzer.tokenStream("title", new
StringReader(fullQuery.toString()):
CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
BooleanQuery q = new BooleanQuery();
while(stream.incrementToken()) {
  q.addClause(new BooleanClause(Occur.MUST, new Term("title",
termAttr.toString(;
}

you also have access to the token positions if you want to create
phrase queries etc. just add a PositionIncrementAttribute like this:
PositionIncrementAttribute posAttr =
stream.addAttribute(PositionsIncrementAttribute.class);

pls. doublecheck the code it's straight from the top of my head.

simon

>
> Like I said, this seems to kind of work but it doesn't feel right.  Does this 
> make sense?  Is there a better way?
>
> thanks in advance,
>
> Bill

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ToParentBlockJoinQuery - Faceting on Parent and Child Documents

2012-08-03 Thread Martijn v Groningen
Hi Jayendra,

This isn't supported yet. You could implement this by creating a
custom Lucene collector.
This collector could count the unique hits inside a block of docs per
unique facet field value. The
unique facet values could be retrieved from Lucene's FieldCache or doc
values (if you can use Lucene 4.0
in your project).

In general I think this would be a cool addition!

Martijn

On 25 July 2012 13:37, Jayendra Patil  wrote:
> Thanks Mike for the wonderful work on ToParentBlockJoinQuery.
>
> We had a use case for Relational data search and are working with
> ToParentBlockJoinQuery which works perfectly as mentioned @
> http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
>
> However, I couldn't find any examples on net or even in the JUnit
> testcases to use Faceting on the Parent or the Child results.
>
> Is it supported as yet ??? Can you provide us with any examples ??
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ToParentBlockJoinQuery - Faceting on Parent and Child Documents

2012-08-03 Thread Christoph Kaser

Hi Jayendra,

we use facetting and blockjoinqueries on lucene 3.6 like this:

- Create the FacetsCollector
- For facetting on Parent documents, use ToParentBlockJoinQuery, for 
facetting on children ToChildBlockJoinQuery (if needed, add additional 
query clauses using a Booleanquery)

- Use searcher.search(query,null,facetCollector)

This seems to work fine.

Best regards,
Christoph Kaser

Am 03.08.2012 13:50, schrieb Martijn v Groningen:

Hi Jayendra,

This isn't supported yet. You could implement this by creating a
custom Lucene collector.
This collector could count the unique hits inside a block of docs per
unique facet field value. The
unique facet values could be retrieved from Lucene's FieldCache or doc
values (if you can use Lucene 4.0
in your project).

In general I think this would be a cool addition!

Martijn

On 25 July 2012 13:37, Jayendra Patil  wrote:

Thanks Mike for the wonderful work on ToParentBlockJoinQuery.

We had a use case for Relational data search and are working with
ToParentBlockJoinQuery which works perfectly as mentioned @
http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html

However, I couldn't find any examples on net or even in the JUnit
testcases to use Faceting on the Parent or the Child results.

Is it supported as yet ??? Can you provide us with any examples ??

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Analyzer on query question

2012-08-03 Thread Bill Chesky
Thanks Simon,

Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem to 
have been introduced until 3.1.0.  Similarly my version of Lucene does not have 
a BooleanQuery.addClause(BooleanClause) method.  Maybe you meant 
BooleanQuery.add(BooleanClause).

In any case, most of what you're doing there, I'm just not familiar with.  
Seems very low level.  I've never had to use TokenStreams to build a query 
before and I'm not really sure what is going on there.  Also, I don't know what 
PositionIncrementAttribute is or how it would be used to create a PhraseQuery.  
 The way I'm currently creating PhraseQuerys is very straightforward and 
intuitive.  E.g. to search for the term "foo bar" I'd build the query like this:

PhraseQuery phraseQuery = new 
PhraseQuery();
phraseQuery.add(new 
Term("title", "foo"));
phraseQuery.add(new 
Term("title", "bar"));

Is there really no easier way to associate the correct analyzer with these 
types of queries?

Bill

-Original Message-
From: Simon Willnauer [mailto:simon.willna...@gmail.com] 
Sent: Friday, August 03, 2012 3:43 AM
To: java-user@lucene.apache.org; Bill Chesky
Subject: Re: Analyzer on query question

On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky
 wrote:
> Hi,
>
> I understand that generally speaking you should use the same analyzer on 
> querying as was used on indexing.  In my code I am using the SnowballAnalyzer 
> on index creation.  However, on the query side I am building up a complex 
> BooleanQuery from other BooleanQuerys and/or PhraseQuerys on several fields.  
> None of these require specifying an analyzer anywhere.  This is causing some 
> odd results, I think, because a different analyzer (or no analyzer?) is being 
> used for the query.
>
> Question: how do I build my boolean and phrase queries using the 
> SnowballAnalyzer?
>
> One thing I did that seemed to kind of work was to build my complex query 
> normally then build a snowball-analyzed query using a QueryParser 
> instantiated with a SnowballAnalyzer.  To do this, I simply pass the string 
> value of the complex query to the QueryParser.parse() method to get the new 
> query.  Something like this:
>
> // build a complex query from other BooleanQuerys and PhraseQuerys
> BooleanQuery fullQuery = buildComplexQuery();
> QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new 
> SnowballAnalyzer(Version.LUCENE_30, "English"));
> Query snowballAnalyzedQuery = parser.parse(fullQuery.toString());
>
> TopScoreDocCollector collector = TopScoreDocCollector.create(1, true);
> indexSearcher.search(snowballAnalyzedQuery, collector);

you can just use the analyzer directly like this:
Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");

TokenStream stream = analyzer.tokenStream("title", new
StringReader(fullQuery.toString()):
CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
BooleanQuery q = new BooleanQuery();
while(stream.incrementToken()) {
  q.addClause(new BooleanClause(Occur.MUST, new Term("title",
termAttr.toString(;
}

you also have access to the token positions if you want to create
phrase queries etc. just add a PositionIncrementAttribute like this:
PositionIncrementAttribute posAttr =
stream.addAttribute(PositionsIncrementAttribute.class);

pls. doublecheck the code it's straight from the top of my head.

simon

>
> Like I said, this seems to kind of work but it doesn't feel right.  Does this 
> make sense?  Is there a better way?
>
> thanks in advance,
>
> Bill



Re: Analyzer on query question

2012-08-03 Thread Ian Lea
You can add parsed queries to a BooleanQuery.  Would that help in this case?

SnowballAnalyzer sba = whatever();
QueryParser qp = new QueryParser(..., sba);
Query q1 = qp.parse("some snowball string");
Query q2 = qp.parse("some other snowball string");

BooleanQuery bq = new BooleanQuery();
bq.add(q1, ...);
bq.add(q2, ...);
bq.add(loads of other stuff);


--
ian.


On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky  wrote:
> Thanks Simon,
>
> Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem to 
> have been introduced until 3.1.0.  Similarly my version of Lucene does not 
> have a BooleanQuery.addClause(BooleanClause) method.  Maybe you meant 
> BooleanQuery.add(BooleanClause).
>
> In any case, most of what you're doing there, I'm just not familiar with.  
> Seems very low level.  I've never had to use TokenStreams to build a query 
> before and I'm not really sure what is going on there.  Also, I don't know 
> what PositionIncrementAttribute is or how it would be used to create a 
> PhraseQuery.   The way I'm currently creating PhraseQuerys is very 
> straightforward and intuitive.  E.g. to search for the term "foo bar" I'd 
> build the query like this:
>
> PhraseQuery phraseQuery = new 
> PhraseQuery();
> phraseQuery.add(new 
> Term("title", "foo"));
> phraseQuery.add(new 
> Term("title", "bar"));
>
> Is there really no easier way to associate the correct analyzer with these 
> types of queries?
>
> Bill
>
> -Original Message-
> From: Simon Willnauer [mailto:simon.willna...@gmail.com]
> Sent: Friday, August 03, 2012 3:43 AM
> To: java-user@lucene.apache.org; Bill Chesky
> Subject: Re: Analyzer on query question
>
> On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky
>  wrote:
>> Hi,
>>
>> I understand that generally speaking you should use the same analyzer on 
>> querying as was used on indexing.  In my code I am using the 
>> SnowballAnalyzer on index creation.  However, on the query side I am 
>> building up a complex BooleanQuery from other BooleanQuerys and/or 
>> PhraseQuerys on several fields.  None of these require specifying an 
>> analyzer anywhere.  This is causing some odd results, I think, because a 
>> different analyzer (or no analyzer?) is being used for the query.
>>
>> Question: how do I build my boolean and phrase queries using the 
>> SnowballAnalyzer?
>>
>> One thing I did that seemed to kind of work was to build my complex query 
>> normally then build a snowball-analyzed query using a QueryParser 
>> instantiated with a SnowballAnalyzer.  To do this, I simply pass the string 
>> value of the complex query to the QueryParser.parse() method to get the new 
>> query.  Something like this:
>>
>> // build a complex query from other BooleanQuerys and PhraseQuerys
>> BooleanQuery fullQuery = buildComplexQuery();
>> QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new 
>> SnowballAnalyzer(Version.LUCENE_30, "English"));
>> Query snowballAnalyzedQuery = parser.parse(fullQuery.toString());
>>
>> TopScoreDocCollector collector = TopScoreDocCollector.create(1, 
>> true);
>> indexSearcher.search(snowballAnalyzedQuery, collector);
>
> you can just use the analyzer directly like this:
> Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
>
> TokenStream stream = analyzer.tokenStream("title", new
> StringReader(fullQuery.toString()):
> CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class);
> stream.reset();
> BooleanQuery q = new BooleanQuery();
> while(stream.incrementToken()) {
>   q.addClause(new BooleanClause(Occur.MUST, new Term("title",
> termAttr.toString(;
> }
>
> you also have access to the token positions if you want to create
> phrase queries etc. just add a PositionIncrementAttribute like this:
> PositionIncrementAttribute posAttr =
> stream.addAttribute(PositionsIncrementAttribute.class);
>
> pls. doublecheck the code it's straight from the top of my head.
>
> simon
>
>>
>> Like I said, this seems to kind of work but it doesn't feel right.  Does 
>> this make sense?  Is there a better way?
>>
>> thanks in advance,
>>
>> Bill
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Analyzer on query question

2012-08-03 Thread Jack Krupansky
Bill, the simple answer to your original question is that in general you 
should apply the same or similar analysis for your query terms as you do 
with your indexed data. In your specific case the Query.toString is 
generating your unanalyzed terms and then the query parser is performing the 
needed analysis. The real point is that you should be doing the tem analysis 
before invoking "new Term". Alas, term analysis has changed dramatically 
over the past couple of years, so the solution to doing analysis before 
generating a Term/TermQuery will vary from Lucene release to release.


We really do need a wiki page for Lucene term analysis.

-- Jack Krupansky

-Original Message- 
From: Bill Chesky

Sent: Friday, August 03, 2012 9:19 AM
To: simon.willna...@gmail.com ; java-user@lucene.apache.org
Subject: RE: Analyzer on query question

Thanks Simon,

Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem to 
have been introduced until 3.1.0.  Similarly my version of Lucene does not 
have a BooleanQuery.addClause(BooleanClause) method.  Maybe you meant 
BooleanQuery.add(BooleanClause).


In any case, most of what you're doing there, I'm just not familiar with. 
Seems very low level.  I've never had to use TokenStreams to build a query 
before and I'm not really sure what is going on there.  Also, I don't know 
what PositionIncrementAttribute is or how it would be used to create a 
PhraseQuery.   The way I'm currently creating PhraseQuerys is very 
straightforward and intuitive.  E.g. to search for the term "foo bar" I'd 
build the query like this:


PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.add(new Term("title", "foo"));
phraseQuery.add(new Term("title", "bar"));

Is there really no easier way to associate the correct analyzer with these 
types of queries?


Bill

-Original Message-
From: Simon Willnauer [mailto:simon.willna...@gmail.com]
Sent: Friday, August 03, 2012 3:43 AM
To: java-user@lucene.apache.org; Bill Chesky
Subject: Re: Analyzer on query question

On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky
 wrote:

Hi,

I understand that generally speaking you should use the same analyzer on 
querying as was used on indexing.  In my code I am using the 
SnowballAnalyzer on index creation.  However, on the query side I am 
building up a complex BooleanQuery from other BooleanQuerys and/or 
PhraseQuerys on several fields.  None of these require specifying an 
analyzer anywhere.  This is causing some odd results, I think, because a 
different analyzer (or no analyzer?) is being used for the query.


Question: how do I build my boolean and phrase queries using the 
SnowballAnalyzer?


One thing I did that seemed to kind of work was to build my complex query 
normally then build a snowball-analyzed query using a QueryParser 
instantiated with a SnowballAnalyzer.  To do this, I simply pass the 
string value of the complex query to the QueryParser.parse() method to get 
the new query.  Something like this:


// build a complex query from other BooleanQuerys and PhraseQuerys
BooleanQuery fullQuery = buildComplexQuery();
QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new 
SnowballAnalyzer(Version.LUCENE_30, "English"));

Query snowballAnalyzedQuery = parser.parse(fullQuery.toString());

TopScoreDocCollector collector = TopScoreDocCollector.create(1, 
true);

indexSearcher.search(snowballAnalyzedQuery, collector);


you can just use the analyzer directly like this:
Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");

TokenStream stream = analyzer.tokenStream("title", new
StringReader(fullQuery.toString()):
CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
BooleanQuery q = new BooleanQuery();
while(stream.incrementToken()) {
 q.addClause(new BooleanClause(Occur.MUST, new Term("title",
termAttr.toString(;
}

you also have access to the token positions if you want to create
phrase queries etc. just add a PositionIncrementAttribute like this:
PositionIncrementAttribute posAttr =
stream.addAttribute(PositionsIncrementAttribute.class);

pls. doublecheck the code it's straight from the top of my head.

simon



Like I said, this seems to kind of work but it doesn't feel right.  Does 
this make sense?  Is there a better way?


thanks in advance,

Bill



--
T ususcib, -mil jvausr-nsbs...@ucneapch.ogfo adiioalcomads 
emal:jaa-se-hlpluen.aace.rg 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Problem with near realtime search

2012-08-03 Thread Harald Kirsch
I am trying to (mis)use Lucene a bit like a NoSQL database or, rather, a 
persistent map. I am entering 38000 documents at a rate of 1000/s to the 
index. Because each item add may be actually an update, I have a 
sequence of read/change/write for each of the documents.


All goes well until when just after writing the last item, I run a query 
that retrieves about 16000 documents. All docids are collected in a 
Collector, and, yes, I make sure to rebase the docIds. Then I iterate 
over all docIds found and retrieve the documents basically like this:


  for(int docId : docIds) {
Document d = getSearcher().doc(docId);
..
  }

where getSearcher() uses IndexReader.openIfChanged() to always get the 
most current searcher and makes sure to eventually close the old searcher.



At document 15940 I get an exception like this:

Exception in thread "main" java.lang.IllegalArgumentException: docID 
must be >= 0 and < maxDoc=1 (got docID=1)

at 
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
	at 
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)

at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)

I can get rid of the Exception by one of two ways that I both don't like:

1) Put a Thread.sleep(1000) just before running the query+document 
retrieval part.


2) Use the same IndexSearcher to retrieve all documents instead of 
calling getSearcher for each document retrieval.


This is just a test single threaded test program. I only see Lucene 
Merge threads in jvisualvm besides the main thread. A breakpoint on the 
exception shows that org.apache.lucene.index.DirectoryReader.document 
does seem to have wrong segments, which triggers the Exception.


Since Lucene 3.6.1 is in productive use for some time I doubt it is a 
bug in Lucene, but I don't see what I am doing wrong. It might be 
connected to trying to get the freshest IndexReader for retrieving 
documents.


Any better ideas or explanations?

Harald.

--
Harald Kirsch


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problem with near realtime search

2012-08-03 Thread Simon Willnauer
hey harald,

if you use a possibly different searcher (reader) than you used for
the search you will run into problems with the doc IDs since they
might change during the request. I suggest you to use SearcherManager
or NRTMangager and carry on the searcher reference when you collect
the stored values. Just keep around the searcher you used and
NRTManager / SearcherManager will do the job for you.

simon

On Fri, Aug 3, 2012 at 3:41 PM, Harald Kirsch  wrote:
> I am trying to (mis)use Lucene a bit like a NoSQL database or, rather, a
> persistent map. I am entering 38000 documents at a rate of 1000/s to the
> index. Because each item add may be actually an update, I have a sequence of
> read/change/write for each of the documents.
>
> All goes well until when just after writing the last item, I run a query
> that retrieves about 16000 documents. All docids are collected in a
> Collector, and, yes, I make sure to rebase the docIds. Then I iterate over
> all docIds found and retrieve the documents basically like this:
>
>   for(int docId : docIds) {
> Document d = getSearcher().doc(docId);
> ..
>   }
>
> where getSearcher() uses IndexReader.openIfChanged() to always get the most
> current searcher and makes sure to eventually close the old searcher.
>
>
> At document 15940 I get an exception like this:
>
> Exception in thread "main" java.lang.IllegalArgumentException: docID must be
>>= 0 and < maxDoc=1 (got docID=1)
> at
> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
> at
> org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)
> at
> org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)
>
> I can get rid of the Exception by one of two ways that I both don't like:
>
> 1) Put a Thread.sleep(1000) just before running the query+document retrieval
> part.
>
> 2) Use the same IndexSearcher to retrieve all documents instead of calling
> getSearcher for each document retrieval.
>
> This is just a test single threaded test program. I only see Lucene Merge
> threads in jvisualvm besides the main thread. A breakpoint on the exception
> shows that org.apache.lucene.index.DirectoryReader.document does seem to
> have wrong segments, which triggers the Exception.
>
> Since Lucene 3.6.1 is in productive use for some time I doubt it is a bug in
> Lucene, but I don't see what I am doing wrong. It might be connected to
> trying to get the freshest IndexReader for retrieving documents.
>
> Any better ideas or explanations?
>
> Harald.
>
> --
> Harald Kirsch
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Analyzer on query question

2012-08-03 Thread Bill Chesky
Jack,

Thanks.  Yeah, I don't know what you mean be term analysis.  I googled it but 
didn't come up with much.  So if that is the preferred way of doing this, a 
wiki document would be greatly appreciated.  

I notice you did say I should be doing the term analysis first.  But is it 
wrong to do it the way I described in my original email?  Will it give me 
incorrect results?

Bill


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Friday, August 03, 2012 9:33 AM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

Bill, the simple answer to your original question is that in general you 
should apply the same or similar analysis for your query terms as you do 
with your indexed data. In your specific case the Query.toString is 
generating your unanalyzed terms and then the query parser is performing the 
needed analysis. The real point is that you should be doing the tem analysis 
before invoking "new Term". Alas, term analysis has changed dramatically 
over the past couple of years, so the solution to doing analysis before 
generating a Term/TermQuery will vary from Lucene release to release.

We really do need a wiki page for Lucene term analysis.

-- Jack Krupansky

-Original Message- 
From: Bill Chesky
Sent: Friday, August 03, 2012 9:19 AM
To: simon.willna...@gmail.com ; java-user@lucene.apache.org
Subject: RE: Analyzer on query question

Thanks Simon,

Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem to 
have been introduced until 3.1.0.  Similarly my version of Lucene does not 
have a BooleanQuery.addClause(BooleanClause) method.  Maybe you meant 
BooleanQuery.add(BooleanClause).

In any case, most of what you're doing there, I'm just not familiar with. 
Seems very low level.  I've never had to use TokenStreams to build a query 
before and I'm not really sure what is going on there.  Also, I don't know 
what PositionIncrementAttribute is or how it would be used to create a 
PhraseQuery.   The way I'm currently creating PhraseQuerys is very 
straightforward and intuitive.  E.g. to search for the term "foo bar" I'd 
build the query like this:

PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.add(new Term("title", "foo"));
phraseQuery.add(new Term("title", "bar"));

Is there really no easier way to associate the correct analyzer with these 
types of queries?

Bill

-Original Message-
From: Simon Willnauer [mailto:simon.willna...@gmail.com]
Sent: Friday, August 03, 2012 3:43 AM
To: java-user@lucene.apache.org; Bill Chesky
Subject: Re: Analyzer on query question

On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky
 wrote:
> Hi,
>
> I understand that generally speaking you should use the same analyzer on 
> querying as was used on indexing.  In my code I am using the 
> SnowballAnalyzer on index creation.  However, on the query side I am 
> building up a complex BooleanQuery from other BooleanQuerys and/or 
> PhraseQuerys on several fields.  None of these require specifying an 
> analyzer anywhere.  This is causing some odd results, I think, because a 
> different analyzer (or no analyzer?) is being used for the query.
>
> Question: how do I build my boolean and phrase queries using the 
> SnowballAnalyzer?
>
> One thing I did that seemed to kind of work was to build my complex query 
> normally then build a snowball-analyzed query using a QueryParser 
> instantiated with a SnowballAnalyzer.  To do this, I simply pass the 
> string value of the complex query to the QueryParser.parse() method to get 
> the new query.  Something like this:
>
> // build a complex query from other BooleanQuerys and PhraseQuerys
> BooleanQuery fullQuery = buildComplexQuery();
> QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new 
> SnowballAnalyzer(Version.LUCENE_30, "English"));
> Query snowballAnalyzedQuery = parser.parse(fullQuery.toString());
>
> TopScoreDocCollector collector = TopScoreDocCollector.create(1, 
> true);
> indexSearcher.search(snowballAnalyzedQuery, collector);

you can just use the analyzer directly like this:
Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");

TokenStream stream = analyzer.tokenStream("title", new
StringReader(fullQuery.toString()):
CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
BooleanQuery q = new BooleanQuery();
while(stream.incrementToken()) {
  q.addClause(new BooleanClause(Occur.MUST, new Term("title",
termAttr.toString(;
}

you also have access to the token positions if you want to create
phrase queries etc. just add a PositionIncrementAttribute like this:
PositionIncrementAttribute posAttr =
stream.addAttribute(PositionsIncrementAttribute.class);

pls. doublecheck the code it's straight from the top of my head.

simon

>
> Like I said, this seems to kind of work but it doesn't feel right.  Does 
> this make sense?  Is there a better way?
>
> thanks in advance,
>
> Bill



RE: Analyzer on query question

2012-08-03 Thread Bill Chesky
Ian,

I gave this method a try, at least the way I understood your suggestion. E.g. 
to search for the phrase "cells combine" I built up a string like:

title:"cells combine" description:"cells combine" text:"cells combine"

then I passed that to the queryParser.parse() method (where queryParser is an 
instance of QueryParser constructed using SnowballAnalyzer) and added the 
result as a MUST clause in my final BooleanQuery.

When I print the resulting query out as a string I get:

+(title:"cell combin" description:"cell combin" keywords:"cell combin")

So it looks like the SnowballAnalyzer is doing some stemming for me.  But this 
is the exact same result I'd get doing it the way I described in my original 
email.  I just built the unanalyzed string on my own rather than using the 
various query classes like PhraseQuery, etc.  

So I don't see the advantage to doing it this way over the original method.  I 
just don't know if the original way I described is wrong or will give me bad 
results.

thanks for the help,

Bill

-Original Message-
From: Ian Lea [mailto:ian@gmail.com] 
Sent: Friday, August 03, 2012 9:32 AM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

You can add parsed queries to a BooleanQuery.  Would that help in this case?

SnowballAnalyzer sba = whatever();
QueryParser qp = new QueryParser(..., sba);
Query q1 = qp.parse("some snowball string");
Query q2 = qp.parse("some other snowball string");

BooleanQuery bq = new BooleanQuery();
bq.add(q1, ...);
bq.add(q2, ...);
bq.add(loads of other stuff);


--
ian.


On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky  wrote:
> Thanks Simon,
>
> Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem to 
> have been introduced until 3.1.0.  Similarly my version of Lucene does not 
> have a BooleanQuery.addClause(BooleanClause) method.  Maybe you meant 
> BooleanQuery.add(BooleanClause).

>
> In any case, most of what you're doing there, I'm just not familiar with.  
> Seems very low level.  I've never had to use TokenStreams to build a query 
> before and I'm not really sure what is going on there.  Also, I don't know 
> what PositionIncrementAttribute is or how it would be used to create a 
> PhraseQuery.   The way I'm currently creating PhraseQuerys is very 
> straightforward and intuitive.  E.g. to search for the term "foo bar" I'd 
> build the query like this:
>
> PhraseQuery phraseQuery = new 
> PhraseQuery();
> phraseQuery.add(new 
> Term("title", "foo"));
> phraseQuery.add(new 
> Term("title", "bar"));
>
> Is there really no easier way to associate the correct analyzer with these 
> types of queries?
>
> Bill
>
> -Original Message-
> From: Simon Willnauer [mailto:simon.willna...@gmail.com]
> Sent: Friday, August 03, 2012 3:43 AM
> To: java-user@lucene.apache.org; Bill Chesky
> Subject: Re: Analyzer on query question
>
> On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky
>  wrote:
>> Hi,
>>
>> I understand that generally speaking you should use the same analyzer on 
>> querying as was used on indexing.  In my code I am using the 
>> SnowballAnalyzer on index creation.  However, on the query side I am 
>> building up a complex BooleanQuery from other BooleanQuerys and/or 
>> PhraseQuerys on several fields.  None of these require specifying an 
>> analyzer anywhere.  This is causing some odd results, I think, because a 
>> different analyzer (or no analyzer?) is being used for the query.
>>
>> Question: how do I build my boolean and phrase queries using the 
>> SnowballAnalyzer?
>>
>> One thing I did that seemed to kind of work was to build my complex query 
>> normally then build a snowball-analyzed query using a QueryParser 
>> instantiated with a SnowballAnalyzer.  To do this, I simply pass the string 
>> value of the complex query to the QueryParser.parse() method to get the new 
>> query.  Something like this:
>>
>> // build a complex query from other BooleanQuerys and PhraseQuerys
>> BooleanQuery fullQuery = buildComplexQuery();
>> QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new 
>> SnowballAnalyzer(Version.LUCENE_30, "English"));
>> Query snowballAnalyzedQuery = parser.parse(fullQuery.toString());
>>
>> TopScoreDocCollector collector = TopScoreDocCollector.create(1, 
>> true);
>> indexSearcher.search(snowballAnalyzedQuery, collector);
>
> you can just use the analyzer directly like this:
> Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
>
> TokenStream stream = analyzer.tokenStream("title", new
> StringReader(fullQuery.toString()):
> CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class);
> stream.reset();
> BooleanQuery q = new BooleanQuery();
> while(stream.incrementToken()) {
>   q.addClause(new BooleanClause(Occur.MUST, new Term("title",
> termAttr.toString

Re: Analyzer on query question

2012-08-03 Thread Ian Lea
Bill


You're getting the snowball stemming either way which I guess is good,
and if you get same results either way maybe it doesn't matter which
technique you use.  I'd be a bit worried about parsing the result of
query.toString() because you aren't guaranteed to get back, in text,
what you put in.

My way seems better to me, but then it would.  If you prefer your way
I won't argue with you.


--
Ian.


On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky  wrote:
> Ian,
>
> I gave this method a try, at least the way I understood your suggestion. E.g. 
> to search for the phrase "cells combine" I built up a string like:
>
> title:"cells combine" description:"cells combine" text:"cells combine"
>
> then I passed that to the queryParser.parse() method (where queryParser is an 
> instance of QueryParser constructed using SnowballAnalyzer) and added the 
> result as a MUST clause in my final BooleanQuery.
>
> When I print the resulting query out as a string I get:
>
> +(title:"cell combin" description:"cell combin" keywords:"cell combin")
>
> So it looks like the SnowballAnalyzer is doing some stemming for me.  But 
> this is the exact same result I'd get doing it the way I described in my 
> original email.  I just built the unanalyzed string on my own rather than 
> using the various query classes like PhraseQuery, etc.
>
> So I don't see the advantage to doing it this way over the original method.  
> I just don't know if the original way I described is wrong or will give me 
> bad results.
>
> thanks for the help,
>
> Bill
>
> -Original Message-
> From: Ian Lea [mailto:ian@gmail.com]
> Sent: Friday, August 03, 2012 9:32 AM
> To: java-user@lucene.apache.org
> Subject: Re: Analyzer on query question
>
> You can add parsed queries to a BooleanQuery.  Would that help in this case?
>
> SnowballAnalyzer sba = whatever();
> QueryParser qp = new QueryParser(..., sba);
> Query q1 = qp.parse("some snowball string");
> Query q2 = qp.parse("some other snowball string");
>
> BooleanQuery bq = new BooleanQuery();
> bq.add(q1, ...);
> bq.add(q2, ...);
> bq.add(loads of other stuff);
>
>
> --
> ian.
>
>
> On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky  
> wrote:
>> Thanks Simon,
>>
>> Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem to 
>> have been introduced until 3.1.0.  Similarly my version of Lucene does not 
>> have a BooleanQuery.addClause(BooleanClause) method.  Maybe you meant 
>> BooleanQuery.add(BooleanClause).
>
>>
>> In any case, most of what you're doing there, I'm just not familiar with.  
>> Seems very low level.  I've never had to use TokenStreams to build a query 
>> before and I'm not really sure what is going on there.  Also, I don't know 
>> what PositionIncrementAttribute is or how it would be used to create a 
>> PhraseQuery.   The way I'm currently creating PhraseQuerys is very 
>> straightforward and intuitive.  E.g. to search for the term "foo bar" I'd 
>> build the query like this:
>>
>> PhraseQuery phraseQuery = 
>> new PhraseQuery();
>> phraseQuery.add(new 
>> Term("title", "foo"));
>> phraseQuery.add(new 
>> Term("title", "bar"));
>>
>> Is there really no easier way to associate the correct analyzer with these 
>> types of queries?
>>
>> Bill
>>
>> -Original Message-
>> From: Simon Willnauer [mailto:simon.willna...@gmail.com]
>> Sent: Friday, August 03, 2012 3:43 AM
>> To: java-user@lucene.apache.org; Bill Chesky
>> Subject: Re: Analyzer on query question
>>
>> On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky
>>  wrote:
>>> Hi,
>>>
>>> I understand that generally speaking you should use the same analyzer on 
>>> querying as was used on indexing.  In my code I am using the 
>>> SnowballAnalyzer on index creation.  However, on the query side I am 
>>> building up a complex BooleanQuery from other BooleanQuerys and/or 
>>> PhraseQuerys on several fields.  None of these require specifying an 
>>> analyzer anywhere.  This is causing some odd results, I think, because a 
>>> different analyzer (or no analyzer?) is being used for the query.
>>>
>>> Question: how do I build my boolean and phrase queries using the 
>>> SnowballAnalyzer?
>>>
>>> One thing I did that seemed to kind of work was to build my complex query 
>>> normally then build a snowball-analyzed query using a QueryParser 
>>> instantiated with a SnowballAnalyzer.  To do this, I simply pass the string 
>>> value of the complex query to the QueryParser.parse() method to get the new 
>>> query.  Something like this:
>>>
>>> // build a complex query from other BooleanQuerys and PhraseQuerys
>>> BooleanQuery fullQuery = buildComplexQuery();
>>> QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new 
>>> SnowballAnalyzer(Version.LUCENE_30, "English"));
>>> Query snowballAnalyzedQuery = parser.parse(fullQuery.toString());
>>>
>>> TopScoreD

Re: Analyzer on query question

2012-08-03 Thread Jack Krupansky
Bill, the re-parse of Query.toString will work provided that your query 
terms are either un-analyzed or their analyzer is "idempotent" (can be 
applied repeatedly without changing the output terms.) In your case, you are 
doing the former.


The bottom line: 1) if it works for you, great, 2) for other readers, please 
do not depend on this approach if your input data is filtered in any way - 
if your index analyzer "filters" terms (e.g, stemming, case changes, 
term-splitting), your Term/TermQuery should be analyzed/filtered comparably, 
in which case the extra parse (to cause term analysis such as stemming) 
becomes unnecessary and risky if you are not very careful or very lucky.


-- Jack Krupansky

-Original Message- 
From: Ian Lea

Sent: Friday, August 03, 2012 1:12 PM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

Bill


You're getting the snowball stemming either way which I guess is good,
and if you get same results either way maybe it doesn't matter which
technique you use.  I'd be a bit worried about parsing the result of
query.toString() because you aren't guaranteed to get back, in text,
what you put in.

My way seems better to me, but then it would.  If you prefer your way
I won't argue with you.


--
Ian.


On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky  
wrote:

Ian,

I gave this method a try, at least the way I understood your suggestion. 
E.g. to search for the phrase "cells combine" I built up a string like:


title:"cells combine" description:"cells combine" text:"cells combine"

then I passed that to the queryParser.parse() method (where queryParser is 
an instance of QueryParser constructed using SnowballAnalyzer) and added 
the result as a MUST clause in my final BooleanQuery.


When I print the resulting query out as a string I get:

+(title:"cell combin" description:"cell combin" keywords:"cell combin")

So it looks like the SnowballAnalyzer is doing some stemming for me.  But 
this is the exact same result I'd get doing it the way I described in my 
original email.  I just built the unanalyzed string on my own rather than 
using the various query classes like PhraseQuery, etc.


So I don't see the advantage to doing it this way over the original 
method.  I just don't know if the original way I described is wrong or 
will give me bad results.


thanks for the help,

Bill

-Original Message-
From: Ian Lea [mailto:ian@gmail.com]
Sent: Friday, August 03, 2012 9:32 AM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

You can add parsed queries to a BooleanQuery.  Would that help in this 
case?


SnowballAnalyzer sba = whatever();
QueryParser qp = new QueryParser(..., sba);
Query q1 = qp.parse("some snowball string");
Query q2 = qp.parse("some other snowball string");

BooleanQuery bq = new BooleanQuery();
bq.add(q1, ...);
bq.add(q2, ...);
bq.add(loads of other stuff);


--
ian.


On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky  
wrote:

Thanks Simon,

Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem 
to have been introduced until 3.1.0.  Similarly my version of Lucene does 
not have a BooleanQuery.addClause(BooleanClause) method.  Maybe you meant 
BooleanQuery.add(BooleanClause).




In any case, most of what you're doing there, I'm just not familiar with. 
Seems very low level.  I've never had to use TokenStreams to build a 
query before and I'm not really sure what is going on there.  Also, I 
don't know what PositionIncrementAttribute is or how it would be used to 
create a PhraseQuery.   The way I'm currently creating PhraseQuerys is 
very straightforward and intuitive.  E.g. to search for the term "foo 
bar" I'd build the query like this:


PhraseQuery phraseQuery = 
new PhraseQuery();
phraseQuery.add(new 
Term("title", "foo"));
phraseQuery.add(new 
Term("title", "bar"));


Is there really no easier way to associate the correct analyzer with 
these types of queries?


Bill

-Original Message-
From: Simon Willnauer [mailto:simon.willna...@gmail.com]
Sent: Friday, August 03, 2012 3:43 AM
To: java-user@lucene.apache.org; Bill Chesky
Subject: Re: Analyzer on query question

On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky
 wrote:

Hi,

I understand that generally speaking you should use the same analyzer on 
querying as was used on indexing.  In my code I am using the 
SnowballAnalyzer on index creation.  However, on the query side I am 
building up a complex BooleanQuery from other BooleanQuerys and/or 
PhraseQuerys on several fields.  None of these require specifying an 
analyzer anywhere.  This is causing some odd results, I think, because a 
different analyzer (or no analyzer?) is being used for the query.


Question: how do I build my boolean and phrase queries using the 
SnowballAnalyzer?


One thing I did that seemed to kind of work was to build my

RE: Analyzer on query question

2012-08-03 Thread Bill Chesky
Ian/Jack,

Ok, thanks for the help.  I certainly don't want to take a cheap way out, hence 
my original question about whether this is the right way to do this.  Jack, you 
say the right way is to do Term analysis before creating the Term.  If anybody 
has any information on how to accomplish this I'd greatly appreciate it.

regards,

Bill

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Friday, August 03, 2012 1:22 PM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

Bill, the re-parse of Query.toString will work provided that your query 
terms are either un-analyzed or their analyzer is "idempotent" (can be 
applied repeatedly without changing the output terms.) In your case, you are 
doing the former.

The bottom line: 1) if it works for you, great, 2) for other readers, please 
do not depend on this approach if your input data is filtered in any way - 
if your index analyzer "filters" terms (e.g, stemming, case changes, 
term-splitting), your Term/TermQuery should be analyzed/filtered comparably, 
in which case the extra parse (to cause term analysis such as stemming) 
becomes unnecessary and risky if you are not very careful or very lucky.

-- Jack Krupansky

-Original Message- 
From: Ian Lea
Sent: Friday, August 03, 2012 1:12 PM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

Bill


You're getting the snowball stemming either way which I guess is good,
and if you get same results either way maybe it doesn't matter which
technique you use.  I'd be a bit worried about parsing the result of
query.toString() because you aren't guaranteed to get back, in text,
what you put in.

My way seems better to me, but then it would.  If you prefer your way
I won't argue with you.


--
Ian.


On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky  
wrote:
> Ian,
>
> I gave this method a try, at least the way I understood your suggestion. 
> E.g. to search for the phrase "cells combine" I built up a string like:
>
> title:"cells combine" description:"cells combine" text:"cells combine"
>
> then I passed that to the queryParser.parse() method (where queryParser is 
> an instance of QueryParser constructed using SnowballAnalyzer) and added 
> the result as a MUST clause in my final BooleanQuery.
>
> When I print the resulting query out as a string I get:
>
> +(title:"cell combin" description:"cell combin" keywords:"cell combin")
>
> So it looks like the SnowballAnalyzer is doing some stemming for me.  But 
> this is the exact same result I'd get doing it the way I described in my 
> original email.  I just built the unanalyzed string on my own rather than 
> using the various query classes like PhraseQuery, etc.
>
> So I don't see the advantage to doing it this way over the original 
> method.  I just don't know if the original way I described is wrong or 
> will give me bad results.
>
> thanks for the help,
>
> Bill
>
> -Original Message-
> From: Ian Lea [mailto:ian@gmail.com]
> Sent: Friday, August 03, 2012 9:32 AM
> To: java-user@lucene.apache.org
> Subject: Re: Analyzer on query question
>
> You can add parsed queries to a BooleanQuery.  Would that help in this 
> case?
>
> SnowballAnalyzer sba = whatever();
> QueryParser qp = new QueryParser(..., sba);
> Query q1 = qp.parse("some snowball string");
> Query q2 = qp.parse("some other snowball string");
>
> BooleanQuery bq = new BooleanQuery();
> bq.add(q1, ...);
> bq.add(q2, ...);
> bq.add(loads of other stuff);
>
>
> --
> ian.
>
>
> On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky  
> wrote:
>> Thanks Simon,
>>
>> Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem 
>> to have been introduced until 3.1.0.  Similarly my version of Lucene does 
>> not have a BooleanQuery.addClause(BooleanClause) method.  Maybe you meant 
>> BooleanQuery.add(BooleanClause).
>
>>
>> In any case, most of what you're doing there, I'm just not familiar with. 
>> Seems very low level.  I've never had to use TokenStreams to build a 
>> query before and I'm not really sure what is going on there.  Also, I 
>> don't know what PositionIncrementAttribute is or how it would be used to 
>> create a PhraseQuery.   The way I'm currently creating PhraseQuerys is 
>> very straightforward and intuitive.  E.g. to search for the term "foo 
>> bar" I'd build the query like this:
>>
>> PhraseQuery phraseQuery = 
>> new PhraseQuery();
>> phraseQuery.add(new 
>> Term("title", "foo"));
>> phraseQuery.add(new 
>> Term("title", "bar"));
>>
>> Is there really no easier way to associate the correct analyzer with 
>> these types of queries?
>>
>> Bill
>>
>> -Original Message-
>> From: Simon Willnauer [mailto:simon.willna...@gmail.com]
>> Sent: Friday, August 03, 2012 3:43 AM
>> To: java-user@lucene.apache.org; Bill Chesky
>> Subject: Re: Analyzer on que

Re: Analyzer on query question

2012-08-03 Thread Jack Krupansky

Simon gave sample code for analyzing a multi-term string.

Here's some pseudo-code (hasn't been compiled to check it) to analyze a 
single term with Lucene 3.6:


public Term analyzeTerm(Analyzer analyzer, String termString){
 TokenStream stream  = analyzer.tokenStream(field, new 
StringReader(termString));

 if (stream.incrementToken())
   return new 
Term(stream.getAttribute(CharacterTermAttribute.class).toString());

 else
   return null;
 // TODO: Close the StringReader
 // TODO: Handle terms that analyze into multiple terms (e.g., embedded 
punctuation)

}

And here's the corresponding code for Lucene 4.0:

public Term analyzeTerm(Analyzer analyzer, String termString){
 TokenStream stream  = analyzer.tokenStream(field, new 
StringReader(termString));

 if (stream.incrementToken()){
   TermToBytesRefAttribute termAtt = 
stream.getAttribute(TermToBytesRefAttribute.class);

   BytesRef bytes = termAtt.getBytesRef();
   return new Term(BytesRef.deepCopyOf(bytes));
 } else
   return null;
 // TODO: Close the StringReader
 // TODO: Handle terms that analyze into multiple terms (e.g., embedded 
punctuation)

}

-- Jack Krupansky

-Original Message- 
From: Bill Chesky

Sent: Friday, August 03, 2012 2:55 PM
To: java-user@lucene.apache.org
Subject: RE: Analyzer on query question

Ian/Jack,

Ok, thanks for the help.  I certainly don't want to take a cheap way out, 
hence my original question about whether this is the right way to do this. 
Jack, you say the right way is to do Term analysis before creating the Term. 
If anybody has any information on how to accomplish this I'd greatly 
appreciate it.


regards,

Bill

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Friday, August 03, 2012 1:22 PM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

Bill, the re-parse of Query.toString will work provided that your query
terms are either un-analyzed or their analyzer is "idempotent" (can be
applied repeatedly without changing the output terms.) In your case, you are
doing the former.

The bottom line: 1) if it works for you, great, 2) for other readers, please
do not depend on this approach if your input data is filtered in any way -
if your index analyzer "filters" terms (e.g, stemming, case changes,
term-splitting), your Term/TermQuery should be analyzed/filtered comparably,
in which case the extra parse (to cause term analysis such as stemming)
becomes unnecessary and risky if you are not very careful or very lucky.

-- Jack Krupansky

-Original Message- 
From: Ian Lea

Sent: Friday, August 03, 2012 1:12 PM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

Bill


You're getting the snowball stemming either way which I guess is good,
and if you get same results either way maybe it doesn't matter which
technique you use.  I'd be a bit worried about parsing the result of
query.toString() because you aren't guaranteed to get back, in text,
what you put in.

My way seems better to me, but then it would.  If you prefer your way
I won't argue with you.


--
Ian.


On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky 
wrote:

Ian,

I gave this method a try, at least the way I understood your suggestion.
E.g. to search for the phrase "cells combine" I built up a string like:

title:"cells combine" description:"cells combine" text:"cells combine"

then I passed that to the queryParser.parse() method (where queryParser is
an instance of QueryParser constructed using SnowballAnalyzer) and added
the result as a MUST clause in my final BooleanQuery.

When I print the resulting query out as a string I get:

+(title:"cell combin" description:"cell combin" keywords:"cell combin")

So it looks like the SnowballAnalyzer is doing some stemming for me.  But
this is the exact same result I'd get doing it the way I described in my
original email.  I just built the unanalyzed string on my own rather than
using the various query classes like PhraseQuery, etc.

So I don't see the advantage to doing it this way over the original
method.  I just don't know if the original way I described is wrong or
will give me bad results.

thanks for the help,

Bill

-Original Message-
From: Ian Lea [mailto:ian@gmail.com]
Sent: Friday, August 03, 2012 9:32 AM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

You can add parsed queries to a BooleanQuery.  Would that help in this
case?

SnowballAnalyzer sba = whatever();
QueryParser qp = new QueryParser(..., sba);
Query q1 = qp.parse("some snowball string");
Query q2 = qp.parse("some other snowball string");

BooleanQuery bq = new BooleanQuery();
bq.add(q1, ...);
bq.add(q2, ...);
bq.add(loads of other stuff);


--
ian.


On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky 
wrote:

Thanks Simon,

Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem
to have been introduced until 3.1.0.  Similarly my version of Lucene does
not have a BooleanQuery.addClause(BooleanClause) me

Re: Analyzer on query question

2012-08-03 Thread Robert Muir
you must call reset() before consuming any tokenstream.

On Fri, Aug 3, 2012 at 4:03 PM, Jack Krupansky  wrote:
> Simon gave sample code for analyzing a multi-term string.
>
> Here's some pseudo-code (hasn't been compiled to check it) to analyze a
> single term with Lucene 3.6:
>
> public Term analyzeTerm(Analyzer analyzer, String termString){
>  TokenStream stream  = analyzer.tokenStream(field, new
> StringReader(termString));
>  if (stream.incrementToken())
>return new
> Term(stream.getAttribute(CharacterTermAttribute.class).toString());
>  else
>return null;
>  // TODO: Close the StringReader
>  // TODO: Handle terms that analyze into multiple terms (e.g., embedded
> punctuation)
> }
>
> And here's the corresponding code for Lucene 4.0:
>
> public Term analyzeTerm(Analyzer analyzer, String termString){
>  TokenStream stream  = analyzer.tokenStream(field, new
> StringReader(termString));
>  if (stream.incrementToken()){
>TermToBytesRefAttribute termAtt =
> stream.getAttribute(TermToBytesRefAttribute.class);
>BytesRef bytes = termAtt.getBytesRef();
>return new Term(BytesRef.deepCopyOf(bytes));
>  } else
>return null;
>  // TODO: Close the StringReader
>  // TODO: Handle terms that analyze into multiple terms (e.g., embedded
> punctuation)
> }
>
> -- Jack Krupansky
>
> -Original Message- From: Bill Chesky
> Sent: Friday, August 03, 2012 2:55 PM
> To: java-user@lucene.apache.org
>
> Subject: RE: Analyzer on query question
>
> Ian/Jack,
>
> Ok, thanks for the help.  I certainly don't want to take a cheap way out,
> hence my original question about whether this is the right way to do this.
> Jack, you say the right way is to do Term analysis before creating the Term.
> If anybody has any information on how to accomplish this I'd greatly
> appreciate it.
>
> regards,
>
> Bill
>
> -Original Message-
> From: Jack Krupansky [mailto:j...@basetechnology.com]
> Sent: Friday, August 03, 2012 1:22 PM
> To: java-user@lucene.apache.org
> Subject: Re: Analyzer on query question
>
> Bill, the re-parse of Query.toString will work provided that your query
> terms are either un-analyzed or their analyzer is "idempotent" (can be
> applied repeatedly without changing the output terms.) In your case, you are
> doing the former.
>
> The bottom line: 1) if it works for you, great, 2) for other readers, please
> do not depend on this approach if your input data is filtered in any way -
> if your index analyzer "filters" terms (e.g, stemming, case changes,
> term-splitting), your Term/TermQuery should be analyzed/filtered comparably,
> in which case the extra parse (to cause term analysis such as stemming)
> becomes unnecessary and risky if you are not very careful or very lucky.
>
> -- Jack Krupansky
>
> -Original Message- From: Ian Lea
> Sent: Friday, August 03, 2012 1:12 PM
> To: java-user@lucene.apache.org
> Subject: Re: Analyzer on query question
>
> Bill
>
>
> You're getting the snowball stemming either way which I guess is good,
> and if you get same results either way maybe it doesn't matter which
> technique you use.  I'd be a bit worried about parsing the result of
> query.toString() because you aren't guaranteed to get back, in text,
> what you put in.
>
> My way seems better to me, but then it would.  If you prefer your way
> I won't argue with you.
>
>
> --
> Ian.
>
>
> On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky 
> wrote:
>>
>> Ian,
>>
>> I gave this method a try, at least the way I understood your suggestion.
>> E.g. to search for the phrase "cells combine" I built up a string like:
>>
>> title:"cells combine" description:"cells combine" text:"cells combine"
>>
>> then I passed that to the queryParser.parse() method (where queryParser is
>> an instance of QueryParser constructed using SnowballAnalyzer) and added
>> the result as a MUST clause in my final BooleanQuery.
>>
>> When I print the resulting query out as a string I get:
>>
>> +(title:"cell combin" description:"cell combin" keywords:"cell combin")
>>
>> So it looks like the SnowballAnalyzer is doing some stemming for me.  But
>> this is the exact same result I'd get doing it the way I described in my
>> original email.  I just built the unanalyzed string on my own rather than
>> using the various query classes like PhraseQuery, etc.
>>
>> So I don't see the advantage to doing it this way over the original
>> method.  I just don't know if the original way I described is wrong or
>> will give me bad results.
>>
>> thanks for the help,
>>
>> Bill
>>
>> -Original Message-
>> From: Ian Lea [mailto:ian@gmail.com]
>> Sent: Friday, August 03, 2012 9:32 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Analyzer on query question
>>
>> You can add parsed queries to a BooleanQuery.  Would that help in this
>> case?
>>
>> SnowballAnalyzer sba = whatever();
>> QueryParser qp = new QueryParser(..., sba);
>> Query q1 = qp.parse("some snowball string");
>> Query q2 = qp.parse("some other snowball string");
>>
>>

Re: Analyzer on query question

2012-08-03 Thread Ian Lea
I still don't see what Bill gains by doing the term analysis himself
rather than letting QueryParser do the hard work, in a portable
non-lucene-version-specific way.


--
Ian.


On Fri, Aug 3, 2012 at 9:39 PM, Robert Muir  wrote:
> you must call reset() before consuming any tokenstream.
>
> On Fri, Aug 3, 2012 at 4:03 PM, Jack Krupansky  
> wrote:
>> Simon gave sample code for analyzing a multi-term string.
>>
>> Here's some pseudo-code (hasn't been compiled to check it) to analyze a
>> single term with Lucene 3.6:
>>
>> public Term analyzeTerm(Analyzer analyzer, String termString){
>>  TokenStream stream  = analyzer.tokenStream(field, new
>> StringReader(termString));
>>  if (stream.incrementToken())
>>return new
>> Term(stream.getAttribute(CharacterTermAttribute.class).toString());
>>  else
>>return null;
>>  // TODO: Close the StringReader
>>  // TODO: Handle terms that analyze into multiple terms (e.g., embedded
>> punctuation)
>> }
>>
>> And here's the corresponding code for Lucene 4.0:
>>
>> public Term analyzeTerm(Analyzer analyzer, String termString){
>>  TokenStream stream  = analyzer.tokenStream(field, new
>> StringReader(termString));
>>  if (stream.incrementToken()){
>>TermToBytesRefAttribute termAtt =
>> stream.getAttribute(TermToBytesRefAttribute.class);
>>BytesRef bytes = termAtt.getBytesRef();
>>return new Term(BytesRef.deepCopyOf(bytes));
>>  } else
>>return null;
>>  // TODO: Close the StringReader
>>  // TODO: Handle terms that analyze into multiple terms (e.g., embedded
>> punctuation)
>> }
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Bill Chesky
>> Sent: Friday, August 03, 2012 2:55 PM
>> To: java-user@lucene.apache.org
>>
>> Subject: RE: Analyzer on query question
>>
>> Ian/Jack,
>>
>> Ok, thanks for the help.  I certainly don't want to take a cheap way out,
>> hence my original question about whether this is the right way to do this.
>> Jack, you say the right way is to do Term analysis before creating the Term.
>> If anybody has any information on how to accomplish this I'd greatly
>> appreciate it.
>>
>> regards,
>>
>> Bill
>>
>> -Original Message-
>> From: Jack Krupansky [mailto:j...@basetechnology.com]
>> Sent: Friday, August 03, 2012 1:22 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Analyzer on query question
>>
>> Bill, the re-parse of Query.toString will work provided that your query
>> terms are either un-analyzed or their analyzer is "idempotent" (can be
>> applied repeatedly without changing the output terms.) In your case, you are
>> doing the former.
>>
>> The bottom line: 1) if it works for you, great, 2) for other readers, please
>> do not depend on this approach if your input data is filtered in any way -
>> if your index analyzer "filters" terms (e.g, stemming, case changes,
>> term-splitting), your Term/TermQuery should be analyzed/filtered comparably,
>> in which case the extra parse (to cause term analysis such as stemming)
>> becomes unnecessary and risky if you are not very careful or very lucky.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Ian Lea
>> Sent: Friday, August 03, 2012 1:12 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Analyzer on query question
>>
>> Bill
>>
>>
>> You're getting the snowball stemming either way which I guess is good,
>> and if you get same results either way maybe it doesn't matter which
>> technique you use.  I'd be a bit worried about parsing the result of
>> query.toString() because you aren't guaranteed to get back, in text,
>> what you put in.
>>
>> My way seems better to me, but then it would.  If you prefer your way
>> I won't argue with you.
>>
>>
>> --
>> Ian.
>>
>>
>> On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky 
>> wrote:
>>>
>>> Ian,
>>>
>>> I gave this method a try, at least the way I understood your suggestion.
>>> E.g. to search for the phrase "cells combine" I built up a string like:
>>>
>>> title:"cells combine" description:"cells combine" text:"cells combine"
>>>
>>> then I passed that to the queryParser.parse() method (where queryParser is
>>> an instance of QueryParser constructed using SnowballAnalyzer) and added
>>> the result as a MUST clause in my final BooleanQuery.
>>>
>>> When I print the resulting query out as a string I get:
>>>
>>> +(title:"cell combin" description:"cell combin" keywords:"cell combin")
>>>
>>> So it looks like the SnowballAnalyzer is doing some stemming for me.  But
>>> this is the exact same result I'd get doing it the way I described in my
>>> original email.  I just built the unanalyzed string on my own rather than
>>> using the various query classes like PhraseQuery, etc.
>>>
>>> So I don't see the advantage to doing it this way over the original
>>> method.  I just don't know if the original way I described is wrong or
>>> will give me bad results.
>>>
>>> thanks for the help,
>>>
>>> Bill
>>>
>>> -Original Message-
>>> From: Ian Lea [mailto:ian@gmail.com]
>>> Sent: Friday, August 03, 201

RE: Analyzer on query question

2012-08-03 Thread Bill Chesky
Thanks for the help everybody.  We're using 3.0.1 so I couldn't do exactly what 
Simon and Jack suggested.  But after some searching around I came up with this 
method:

private String analyze(String token) throws Exception {
StringBuffer result = new StringBuffer();

Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, 
"English");
TokenStream tokenStream = analyzer.tokenStream("title", new 
StringReader(token));
tokenStream.reset();
TermAttribute termAttribute = 
tokenStream.getAttribute(TermAttribute.class);

while (tokenStream.incrementToken()) {
if (result.length() > 0) {
result.append(" ");
}

result.append(termAttribute.term());
}

return result.toString();
}

Now I just run my search term strings thru this method first like so:

searchTerms = analyze(searchTerms);

  // now do what I was doing before to build queries...

It's still not totally clear what this buys me since ultimately the query looks 
the same as what was being generated with my original method (perhaps this is 
Ian's point in his last reply).  But I will defer to the gurus.  It works.

Thanks for all the help.

Bill
-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Friday, August 03, 2012 4:03 PM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

Simon gave sample code for analyzing a multi-term string.

Here's some pseudo-code (hasn't been compiled to check it) to analyze a 
single term with Lucene 3.6:

public Term analyzeTerm(Analyzer analyzer, String termString){
  TokenStream stream  = analyzer.tokenStream(field, new 
StringReader(termString));
  if (stream.incrementToken())
return new 
Term(stream.getAttribute(CharacterTermAttribute.class).toString());
  else
return null;
  // TODO: Close the StringReader
  // TODO: Handle terms that analyze into multiple terms (e.g., embedded 
punctuation)
}

And here's the corresponding code for Lucene 4.0:

public Term analyzeTerm(Analyzer analyzer, String termString){
  TokenStream stream  = analyzer.tokenStream(field, new 
StringReader(termString));
  if (stream.incrementToken()){
TermToBytesRefAttribute termAtt = 
stream.getAttribute(TermToBytesRefAttribute.class);
BytesRef bytes = termAtt.getBytesRef();
return new Term(BytesRef.deepCopyOf(bytes));
  } else
return null;
  // TODO: Close the StringReader
  // TODO: Handle terms that analyze into multiple terms (e.g., embedded 
punctuation)
}

-- Jack Krupansky

-Original Message- 
From: Bill Chesky
Sent: Friday, August 03, 2012 2:55 PM
To: java-user@lucene.apache.org
Subject: RE: Analyzer on query question

Ian/Jack,

Ok, thanks for the help.  I certainly don't want to take a cheap way out, 
hence my original question about whether this is the right way to do this. 
Jack, you say the right way is to do Term analysis before creating the Term. 
If anybody has any information on how to accomplish this I'd greatly 
appreciate it.

regards,

Bill

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Friday, August 03, 2012 1:22 PM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

Bill, the re-parse of Query.toString will work provided that your query
terms are either un-analyzed or their analyzer is "idempotent" (can be
applied repeatedly without changing the output terms.) In your case, you are
doing the former.

The bottom line: 1) if it works for you, great, 2) for other readers, please
do not depend on this approach if your input data is filtered in any way -
if your index analyzer "filters" terms (e.g, stemming, case changes,
term-splitting), your Term/TermQuery should be analyzed/filtered comparably,
in which case the extra parse (to cause term analysis such as stemming)
becomes unnecessary and risky if you are not very careful or very lucky.

-- Jack Krupansky

-Original Message- 
From: Ian Lea
Sent: Friday, August 03, 2012 1:12 PM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

Bill


You're getting the snowball stemming either way which I guess is good,
and if you get same results either way maybe it doesn't matter which
technique you use.  I'd be a bit worried about parsing the result of
query.toString() because you aren't guaranteed to get back, in text,
what you put in.

My way seems better to me, but then it would.  If you prefer your way
I won't argue with you.


--
Ian.


On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky 
wrote:
> Ian,
>
> I gave this method a try, at least the way I understood your suggestion.
> E.g. to search for the phrase "cells combine" I built up a string like:
>
> title:"cells combine" description:"cells combine" text:"cells combine"

Re: Analyzer on query question

2012-08-03 Thread Jack Krupansky
What it buys you is not having to convert the whole "complex" query to 
string form, which is not guaranteed to be reparseable for all queries 
(e.g., "AND" or "-abc" as raw terms would be treated as operators), and then 
parsing it which will turn around and regenerate the same query structure 
(you hope). In theory, this will give guarantee fidelity of the query and 
improve performance (the toString/parse round-trip is not cheap/free.)


As I said, the toString/reparse may indeed work for your specific use-case, 
but isn't quite ideal for general use.


-- Jack Krupansky

-Original Message- 
From: Bill Chesky

Sent: Friday, August 03, 2012 5:35 PM
To: java-user@lucene.apache.org
Subject: RE: Analyzer on query question

Thanks for the help everybody.  We're using 3.0.1 so I couldn't do exactly 
what Simon and Jack suggested.  But after some searching around I came up 
with this method:


private String analyze(String token) throws Exception {
StringBuffer result = new StringBuffer();

Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
TokenStream tokenStream = analyzer.tokenStream("title", new 
StringReader(token));

tokenStream.reset();
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);

while (tokenStream.incrementToken()) {
if (result.length() > 0) {
result.append(" ");
}

   result.append(termAttribute.term());
}

return result.toString();
}

Now I just run my search term strings thru this method first like so:

searchTerms = analyze(searchTerms);

 // now do what I was doing before to build queries...

It's still not totally clear what this buys me since ultimately the query 
looks the same as what was being generated with my original method (perhaps 
this is Ian's point in his last reply).  But I will defer to the gurus.  It 
works.


Thanks for all the help.

Bill
-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Friday, August 03, 2012 4:03 PM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

Simon gave sample code for analyzing a multi-term string.

Here's some pseudo-code (hasn't been compiled to check it) to analyze a
single term with Lucene 3.6:

public Term analyzeTerm(Analyzer analyzer, String termString){
 TokenStream stream  = analyzer.tokenStream(field, new
StringReader(termString));
 if (stream.incrementToken())
   return new
Term(stream.getAttribute(CharacterTermAttribute.class).toString());
 else
   return null;
 // TODO: Close the StringReader
 // TODO: Handle terms that analyze into multiple terms (e.g., embedded
punctuation)
}

And here's the corresponding code for Lucene 4.0:

public Term analyzeTerm(Analyzer analyzer, String termString){
 TokenStream stream  = analyzer.tokenStream(field, new
StringReader(termString));
 if (stream.incrementToken()){
   TermToBytesRefAttribute termAtt =
stream.getAttribute(TermToBytesRefAttribute.class);
   BytesRef bytes = termAtt.getBytesRef();
   return new Term(BytesRef.deepCopyOf(bytes));
 } else
   return null;
 // TODO: Close the StringReader
 // TODO: Handle terms that analyze into multiple terms (e.g., embedded
punctuation)
}

-- Jack Krupansky

-Original Message- 
From: Bill Chesky

Sent: Friday, August 03, 2012 2:55 PM
To: java-user@lucene.apache.org
Subject: RE: Analyzer on query question

Ian/Jack,

Ok, thanks for the help.  I certainly don't want to take a cheap way out,
hence my original question about whether this is the right way to do this.
Jack, you say the right way is to do Term analysis before creating the Term.
If anybody has any information on how to accomplish this I'd greatly
appreciate it.

regards,

Bill

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Friday, August 03, 2012 1:22 PM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

Bill, the re-parse of Query.toString will work provided that your query
terms are either un-analyzed or their analyzer is "idempotent" (can be
applied repeatedly without changing the output terms.) In your case, you are
doing the former.

The bottom line: 1) if it works for you, great, 2) for other readers, please
do not depend on this approach if your input data is filtered in any way -
if your index analyzer "filters" terms (e.g, stemming, case changes,
term-splitting), your Term/TermQuery should be analyzed/filtered comparably,
in which case the extra parse (to cause term analysis such as stemming)
becomes unnecessary and risky if you are not very careful or very lucky.

-- Jack Krupansky

-Original Message- 
From: Ian Lea

Sent: Friday, August 03, 2012 1:12 PM
To: java-user@lucene.apache.org
Subject: Re: Analyzer on query question

Bill


You're getting the snowball stemming either way which I guess is good,
and if you get same results either way maybe it doesn't matter which
technique you use.  I'd be a bit worried about parsing the result of
query.toString() because you aren't guaranteed to ge

Re: Problem with near realtime search

2012-08-03 Thread Harald Kirsch

Hello Simon,

thanks for the information. I really thought that once a docId is 
assigned it is kept until the document is deleted. The only problem I 
would have expected are docIds that no longer refer to a document, 
because it was deleted in the meantime. But this is clearly not the case 
in my setup.


But if docIds change during index rearrangement, then this would of 
course completely explain the symptoms I saw.


So docIds can definitively change under the hood?

Harald.


Am 03.08.2012 17:24, schrieb Simon Willnauer:

hey harald,

if you use a possibly different searcher (reader) than you used for
the search you will run into problems with the doc IDs since they
might change during the request. I suggest you to use SearcherManager
or NRTMangager and carry on the searcher reference when you collect
the stored values. Just keep around the searcher you used and
NRTManager / SearcherManager will do the job for you.

simon

On Fri, Aug 3, 2012 at 3:41 PM, Harald Kirsch  wrote:

I am trying to (mis)use Lucene a bit like a NoSQL database or, rather, a
persistent map. I am entering 38000 documents at a rate of 1000/s to the
index. Because each item add may be actually an update, I have a sequence of
read/change/write for each of the documents.

All goes well until when just after writing the last item, I run a query
that retrieves about 16000 documents. All docids are collected in a
Collector, and, yes, I make sure to rebase the docIds. Then I iterate over
all docIds found and retrieve the documents basically like this:

   for(int docId : docIds) {
 Document d = getSearcher().doc(docId);
 ..
   }

where getSearcher() uses IndexReader.openIfChanged() to always get the most
current searcher and makes sure to eventually close the old searcher.


At document 15940 I get an exception like this:

Exception in thread "main" java.lang.IllegalArgumentException: docID must be

= 0 and < maxDoc=1 (got docID=1)

 at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
 at
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)
 at
org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)

I can get rid of the Exception by one of two ways that I both don't like:

1) Put a Thread.sleep(1000) just before running the query+document retrieval
part.

2) Use the same IndexSearcher to retrieve all documents instead of calling
getSearcher for each document retrieval.

This is just a test single threaded test program. I only see Lucene Merge
threads in jvisualvm besides the main thread. A breakpoint on the exception
shows that org.apache.lucene.index.DirectoryReader.document does seem to
have wrong segments, which triggers the Exception.

Since Lucene 3.6.1 is in productive use for some time I doubt it is a bug in
Lucene, but I don't see what I am doing wrong. It might be connected to
trying to get the freshest IndexReader for retrieving documents.

Any better ideas or explanations?

Harald.

--
Harald Kirsch


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




--
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49-211-550266-0
Fax +49-211-550266-19
http://www.raytion.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problem with near realtime search

2012-08-03 Thread Harald Kirsch

Hello Simon,

now that I knew what to search for I found

http://wiki.apache.org/lucene-java/LuceneFAQ#When_is_it_possible_for_document_IDs_to_change.3F

So that clearly explains this issue for me.

Many thanks for your help.

Harald



Am 04.08.2012 07:38, schrieb Harald Kirsch:

Hello Simon,

thanks for the information. I really thought that once a docId is
assigned it is kept until the document is deleted. The only problem I
would have expected are docIds that no longer refer to a document,
because it was deleted in the meantime. But this is clearly not the case
in my setup.

But if docIds change during index rearrangement, then this would of
course completely explain the symptoms I saw.

So docIds can definitively change under the hood?

Harald.


Am 03.08.2012 17:24, schrieb Simon Willnauer:

hey harald,

if you use a possibly different searcher (reader) than you used for
the search you will run into problems with the doc IDs since they
might change during the request. I suggest you to use SearcherManager
or NRTMangager and carry on the searcher reference when you collect
the stored values. Just keep around the searcher you used and
NRTManager / SearcherManager will do the job for you.

simon

On Fri, Aug 3, 2012 at 3:41 PM, Harald Kirsch
 wrote:

I am trying to (mis)use Lucene a bit like a NoSQL database or, rather, a
persistent map. I am entering 38000 documents at a rate of 1000/s to the
index. Because each item add may be actually an update, I have a
sequence of
read/change/write for each of the documents.

All goes well until when just after writing the last item, I run a query
that retrieves about 16000 documents. All docids are collected in a
Collector, and, yes, I make sure to rebase the docIds. Then I iterate
over
all docIds found and retrieve the documents basically like this:

   for(int docId : docIds) {
 Document d = getSearcher().doc(docId);
 ..
   }

where getSearcher() uses IndexReader.openIfChanged() to always get
the most
current searcher and makes sure to eventually close the old searcher.


At document 15940 I get an exception like this:

Exception in thread "main" java.lang.IllegalArgumentException: docID
must be

= 0 and < maxDoc=1 (got docID=1)

 at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
 at
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)

 at
org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)

I can get rid of the Exception by one of two ways that I both don't
like:

1) Put a Thread.sleep(1000) just before running the query+document
retrieval
part.

2) Use the same IndexSearcher to retrieve all documents instead of
calling
getSearcher for each document retrieval.

This is just a test single threaded test program. I only see Lucene
Merge
threads in jvisualvm besides the main thread. A breakpoint on the
exception
shows that org.apache.lucene.index.DirectoryReader.document does seem to
have wrong segments, which triggers the Exception.

Since Lucene 3.6.1 is in productive use for some time I doubt it is a
bug in
Lucene, but I don't see what I am doing wrong. It might be connected to
trying to get the freshest IndexReader for retrieving documents.

Any better ideas or explanations?

Harald.

--
Harald Kirsch


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






--
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49-211-550266-0
Fax +49-211-550266-19
http://www.raytion.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org