Re: Single searcher vs Multi Searcher

2008-10-07 Thread Anshum
There were all of the links I could find:
http://findmeajob.wordpress.com/2007/07/31/lucene-singlesearcher-vs-multisearcher/
http://archives.devshed.com/forums/java-118/how-do-lucene-ids-work-with-multireader-and-multisearcher-993682.html

It has been only out of personal experimentation.
Maintaining 4 indexes and having a multisearcher for the same sounds ok to
me, though it might(most probably would) use up more time (depending on your
term/doc distribution).
I have tried multisearchers for indexes that are sized at around 18Gs with
over 10 million records in them and they seem to be working fine. Memory
utilization hasn't really been my problem though maintenance has been
required to be taken care of if I had record updates as well i.e.
Lets assume we have 4 indexes {1,2,3,4} with 4 being the oldest one. Now a
document got updated and so you would want to move it to the freshest index
viz. index 1. In this case you would also have to delete this document from
index 4 (after figuring out that it exists in 4) else this doc would show up
as 2 hits using a multisearcher.
This is the document movement that I was talking about.
Talking only about online resource utilization ( as what I just spoke about
is generally an offline process), you would have to spend extra resource to
merge the hits from the different indexes.
The other overhead in case of a multisearcher would be of creating those
many searchers - for each index(though shouldn't be much, just though would
point out).
I guess you should try it as speed of search is not realy all that important
to you as compared to running it on a singe box within the memory
limitation.



--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw


On Tue, Oct 7, 2008 at 3:29 PM, Ganesh <[EMAIL PROTECTED]> wrote:

> Hello Anusham,
>
> My intention is to shard the index after every 7 days (week). After 30
> days, (4th  week) the first DB may get deleted. At any point of time i will
> be maitaining 3 to 4 DB.
>
> I want to know the pros and cons of the MultiSearcher or Index sharding
> approach. Any web links would be helpful.
>
> Regards
> Ganesh
>
> - Original Message - From: "Anshum" <[EMAIL PROTECTED]>
> To: 
> Sent: Tuesday, October 07, 2008 12:18 AM
>
> Subject: Re: Single searcher vs Multi Searcher
>
>
>  Hi Ganesh,
>> About the memory consumption while sorting, it would end up using similar
>> amounts, perhaps even more.. like in the case of regular parallel
>> programming algorithms (hoping that you intend to search using a parallel
>> multi searcher). Would you have to query particular indexes only for a
>> particular search or would you be searching over all the indexes and then
>> follow it up by merger (which the parallel multi searcher would do
>> efficiently).?
>> Also, I guess 30 indexes would be a little too many, haven't really tried
>> out those many indexes for a multisearcher.
>> As far as maintenance of DB is concerned, it might be easy as long as you
>> don't have any document updates, in which case you'd have to shift the
>> documents from one DB/index to another (which includes creating an entry
>> in
>> the latest index/DB and deleting the record from the older DB).
>> I guess you'd have to pilot it, in case memory is an issue in your case
>> and
>> not speed, you could try a regular multisearcher instead of a parallel
>> multisearcher.
>> I guess when you say maintenance of the DB gets easier, you mean that the
>> data in each individual table is controlled (but remember there could be
>> other bigger hassles like the one mentioned above about moving data
>> between
>> indexes/DB).
>>
>> --
>> Anshum Gupta
>> Naukri Labs!
>> http://ai-cafe.blogspot.com
>>
>> The facts expressed here belong to everybody, the opinions to me. The
>> distinction is yours to draw
>>
>>
>> On Mon, Oct 6, 2008 at 10:06 AM, Ganesh <[EMAIL PROTECTED]> wrote:
>>
>>  Hello Anshum,
>>>
>>> My index is growing 1 million documents per day. Initially i planned to
>>> have a single database but the sorting of one or more fields consumes
>>> more
>>> RAM. Whether sharding the index would also consume the same.
>>>
>>> My application should co-exist with other application of my product and
>>> my
>>> app could get 1 GB of RAM. Search speed is fine but i need to display the
>>> result in the sorted order.
>>>
>>> I thought to keep 7 days of documents in one index and create one more
>>> after the 7 days. After 30 days the first index may get deleted. I need
>>> to
>>> keep the documents in the index DB for 30 days. My Index DB is in HDD.
>>>
>>> I want to the pros and cons of sharding. I think maintance of the DB
>>> becomes easier.
>>>
>>> It would be very much helpful, if you share some of your thoughts.
>>>
>>> Regards
>>> Ganesh
>>>
>>>
>>> - Original Message - From: "Anshum" <[EMAIL PROTECTED]>
>>> To: 
>>> Sent: Friday, October 03, 200

Re: Re-tokenized fields disappear

2008-10-07 Thread Erick Erickson
This is going to get really sticky given StandardAnalyzer. Let's say that
you have
codesearch:B05 1
codesearch:B05 2
codesearch:B05 3

When you index these, you'll index tokens B05, 1, 2, 3, along with
positional information. How to say "between 1 and 3" becomes a problem,
although it *might* work for you to search for
+codesearch:B05 +codesearch:[1 TO 3]...
(I've forgotten the between syntax, but you get the idea).
But I think that'll kinda work until you encounter case n + 1...

But all is not lost. You might be well served by indexing these with
something like KeywordAnalyzer (note, you might want to roll your
own analyzer that, for instance, uppercases before passing to
KeywordAnalyzer).. Then, in the example above you'd index the following
tokens:

B05 1
B05 2
B05 3


Now, you can search for tokens between "B05 1" and "B05 3" using
the normal syntax syntax.

There are alternative schemes, but I think you would get some mileage
out of thinking about how to index these creatively.

A few points:
1> You may have to carefully massage the data for searching. For instance,
 assume that one of your tokens was B05 100. You might want to
 index "B05 001" rather than "B05 1" since Lucene normally
 searches lexically rather than numerically.
2> This could be a special search field that's a copy of your original. That
is,
 you would have two fields where before you had one.
3> PerFieldAnalyzerWrapper is your friend, both at index and search time
.

Best
Erick

On Mon, Oct 6, 2008 at 11:38 PM, John Griffin <[EMAIL PROTECTED]>wrote:

> My previous question may be moot but as is it is still a problem. Here's a
> little more info on my problem. The same named fields contain two pieces of
> information, a code "B05" and a value "1" as follows. The value can be a
> range such as 1 to 5 or 1 to 100.
>
>
>
> "codesearch", "B05 1"
>
>
>
> This field and other identically names but differently valued fields in the
> same document are related to a specific person as identified by another
> field say SSN. So, one person can have multiple code searches. Both of the
> codesearch values are related to one another and must be searchable such as
>
>
>
> Return all persons with a codesearch value of B05 ranging from 1 to 3.
>
>
>
> How can I go about this? Do these codesearch fields need to be in a
> separate
> index related by SSN?
>
>
>
> Thanks in advance.
>
>
>
> John G.
>
>


Re: ArrayIndexOutOfBoundsException in FastCharStream.readChar

2008-10-07 Thread Edwin Smith
Thanks for the tip. I tried your experiment and, sure enough, it works just 
fine, so it's not the contents but obviously some other behavior of my custom 
reader. (Does the analyzer require that "mark" and "reset" be implemented, for 
example? I did not implement them.)

The stack trace is as follows:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsExceptionat 
java.lang.System.arraycopy(
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(
at org.apache.lucene.analysis.standard.StandardTokenizer.next(
at org.apache.lucene.analysis.standard.StandardFilter.next(
at org.apache.lucene.analysis.LowerCaseFilter.next(
at org.apache.lucene.analysis.StopFilter.next(
at org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.invertField(
at org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.processField(
at org.apache.lucene.index.DocumentsWriter$ThreadState.processDocument(
at org.apache.lucene.index.DocumentsWriter.updateDocument(
at org.apache.lucene.index.DocumentsWriter.addDocument(
at org.apache.lucene.index.IndexWriter.addDocument(
at org.apache.lucene.index.IndexWriter.addDocument(
at com.affinovate.v4.server.search.ServerTest.main(Native 
Method)StandardTokenizerImpl.java:366)StandardTokenizerImpl.java:573)StandardTokenizer.java:139)StandardFilter.java:42)LowerCaseFilter.java:33)StopFilter.java:118)DocumentsWriter.java:1522)DocumentsWriter.java:1412)DocumentsWriter.java:1121)DocumentsWriter.java:2442)DocumentsWriter.java:2424)IndexWriter.java:1464)IndexWriter.java:1442)ServerTest.java:36)

The error is occuring in the second line of zzRefill():

System.arraycopy(
 
I set a breakpoint to catch it before it erred, and the value of zzEndRead is 0 
and the value of zzStartRead 1. Thus the error.
 
I was being clever and made a custom reader using competing threads against a 
SAX parser. I probbably did it more to see if I could than for any valid 
reason, so I will probably just simplify my approach and use the parser to pull 
a complete string and use a string reader like you suggest.
 
Thanks for the help.
 
Ed
zzBuffer, zzStartRead,  zzBuffer, 
0,  zzEndRead-zzStartRead);


- Original Message 
From: Michael McCandless <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, October 7, 2008 5:12:05 AM
Subject: Re: ArrayIndexOutOfBoundsException in FastCharStream.readChar


If you capture the exact text produced by the reader, and wrap it in a  
StringReader and pass that to StandardAnalyzer, do you then see the  
same exception?

Can you post the full stack trace on 2.3.2?

Mike

Edwin Smith wrote:

> I upgraded to the latest, 3.3.2 and had the same problem, even  
> though it was clearly a different lexer reading the text.
>
> I did find some problems with the reader I was using, and it now  
> reads some files that it didn't before, so it may still be some  
> reader problem I haven't identified, but the text coming in from it  
> looks correct to me, so I don't know.
>
> Very frustrating.
>
> Ed
>
>
>
> - Original Message 
> From: Edwin Smith <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Monday, October 6, 2008 3:20:51 PM
> Subject: Re: ArrayIndexOutOfBoundsException in FastCharStream.readChar
>
> No particular reason. It is just what I had loaded last and hadn't  
> upgraded. It sounds like there might be good reason to do that now.
>
> Thanks for the tip.
>
> Ed
>
>
>
> - Original Message 
> From: Steven A Rowe <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Monday, October 6, 2008 3:18:20 PM
> Subject: RE: ArrayIndexOutOfBoundsException in FastCharStream.readChar
>
> Hi Edwin,
>
> I don't know specifically what's causing the exception you're  
> seeing, but note that in Lucene 2.3.0+, the JavaCC-generated version  
> of StandardTokenizer (where your exception originates) has been  
> replaced with a JFlex-generated version - see 
>  >.
>
> FYI, indexing speed was much improved in 2.3.0 over previous  
> versions -- up to 10 times faster, according to reports on this list  
> -- is there any particular reason you aren't using 2.3.2 (the most  
> recent release)?
>
> Steve
>
> On 10/06/2008 at 2:32 PM, Edwin Smith wrote:
>> Oh, and in case it matters, I'm using Lucene 2.2.0.
>>
>> Ed
>>
>>
>>
>> - Original Message 
>>
>>
>> I am stumped and have not seen any other reference to this
>> problem. I am getting the following exception on everything I
>> try to index. Does anyone know what my problem might be?
>>
>> Thanks,
>>
>> Ed
>>
>> java.lang.ArrayIndexOutOfBoundsException at
>> org.apache.lucene.analysis.standard.FastCharStream.readChar( at
>> org.apache.lucene.analysis.standard.FastCharStream.BeginToken( at
>> org.apache.lucene.analysis.standard.StandardTokenizerTokenMana
>> 

spellcheck: issues

2008-10-07 Thread Jason Rennie
Hello, I've been exploring usage of the spellcheck feature via solr 1.3.  I
have it working, but there are some issues I'm seeing that make it less
useful than it could be.  Response on the solr-user mailing list has been
limited.  I'm guessing the reason may be that I'm asking about issues which
are most relevant to the lucene codebase.  So, I hope you don't mind this
cross-posting.

I've noticed a few issues with spellcheck as I've been testing it out for
use on our site...

   1. Rebuild breaks requests - I'm using rebuildOnCommit ATM.  If a commit
   is going on and files are being rebuilt in the spellcheck data dir,
   spellcheck requests yield bogus answers.  I.e. I can issue identical
   requests and get drastically different answers.  The first time, I get
   suggestions and "correctlySpelled" is false.  The second time (during the
   commit), I get no suggestions and "correctlySpelled" is true.  Shouldn't
   spellcheck use the old index until the new one is ready for use, like solr
   does with optimizes?
   2. Inconsistent ordering - The first suggestion changes depending on the
   spellcheck.count that I specify.  If my query is "chanl" and I ask for one
   result, the suggestion is "chant" (freq. 16).  If I ask for 5 results, the
   first suggestion is also "chant"; the other 4 suggestions are less frequent
   (#2 is "chang", freq. 11).  However, if I ask for 10 results, the first
   suggestion is "chanel" (freq. 1296); #2 and #3 are "chant" and "chang"; #9
   is "chan" (freq. 174).  Shouldn't spellcheck always return the best
   suggestion first?  In my case, shouldn't "chanel" always top "chant" and
   "chang" since they all have the same edit distance yet "chanel" is two
   orders of mangnitude more popular?

Is there anything I could be doing wrong to create these problems?  If not,
are these known issues?  If not, should I create jira's for them?

Thanks,

Jason


Re: ArrayIndexOutOfBoundsException in FastCharStream.readChar

2008-10-07 Thread Michael McCandless


If you capture the exact text produced by the reader, and wrap it in a  
StringReader and pass that to StandardAnalyzer, do you then see the  
same exception?


Can you post the full stack trace on 2.3.2?

Mike

Edwin Smith wrote:

I upgraded to the latest, 3.3.2 and had the same problem, even  
though it was clearly a different lexer reading the text.


I did find some problems with the reader I was using, and it now  
reads some files that it didn't before, so it may still be some  
reader problem I haven't identified, but the text coming in from it  
looks correct to me, so I don't know.


Very frustrating.

Ed



- Original Message 
From: Edwin Smith <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Monday, October 6, 2008 3:20:51 PM
Subject: Re: ArrayIndexOutOfBoundsException in FastCharStream.readChar

No particular reason. It is just what I had loaded last and hadn't  
upgraded. It sounds like there might be good reason to do that now.


Thanks for the tip.

Ed



- Original Message 
From: Steven A Rowe <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Monday, October 6, 2008 3:18:20 PM
Subject: RE: ArrayIndexOutOfBoundsException in FastCharStream.readChar

Hi Edwin,

I don't know specifically what's causing the exception you're  
seeing, but note that in Lucene 2.3.0+, the JavaCC-generated version  
of StandardTokenizer (where your exception originates) has been  
replaced with a JFlex-generated version - see .


FYI, indexing speed was much improved in 2.3.0 over previous  
versions -- up to 10 times faster, according to reports on this list  
-- is there any particular reason you aren't using 2.3.2 (the most  
recent release)?


Steve

On 10/06/2008 at 2:32 PM, Edwin Smith wrote:

Oh, and in case it matters, I'm using Lucene 2.2.0.

Ed



- Original Message 


I am stumped and have not seen any other reference to this
problem. I am getting the following exception on everything I
try to index. Does anyone know what my problem might be?

Thanks,

Ed

java.lang.ArrayIndexOutOfBoundsException at
org.apache.lucene.analysis.standard.FastCharStream.readChar( at
org.apache.lucene.analysis.standard.FastCharStream.BeginToken( at
org.apache.lucene.analysis.standard.StandardTokenizerTokenMana
ger.getNextToken( at
org.apache.lucene.analysis.standard.StandardTokenizer.jj_ntk( at
org.apache.lucene.analysis.standard.StandardTokenizer.next( at
org.apache.lucene.analysis.standard.StandardFilter.next( at
org.apache.lucene.analysis.LowerCaseFilter.next( at
org.apache.lucene.analysis.StopFilter.next( at
org.apache.lucene.index.DocumentWriter.invertDocument( at
org.apache.lucene.index.DocumentWriter.addDocument( at
org.apache.lucene.index.IndexWriter.buildSingleDocSegment( at
org.apache.lucene.index.IndexWriter.addDocument( at
org.apache.lucene.index.IndexWriter.addDocument( at
com.affinovate.v4.server.search.Indexer.index( at
com.affinovate.v4.server.search.Indexer.perform( at
com.affinovate.v4.server.db.TaskQueue.run( at java.lang.Thread.run(:
2048FastCharStream.java:46)FastCharStream.java:79)StandardToke
nizerTokenManager.java:1180)StandardTokenizer.java:158)Standar
dTokenizer.java:36)StandardFilter.java:41)LowerCaseFilter.java

33)StopFilter.java:107)DocumentWriter.java:219)DocumentWriter

.java:95)IndexWriter.java:1013)IndexWriter.java:1001)IndexWrit
er.java:983)Indexer.java:61)Indexer.java:93)TaskQueue.java:115
)Thread.java:619)






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Single searcher vs Multi Searcher

2008-10-07 Thread Ganesh

Hello Anusham,

My intention is to shard the index after every 7 days (week). After 30 days, 
(4th  week) the first DB may get deleted. At any point of time i will be 
maitaining 3 to 4 DB.


I want to know the pros and cons of the MultiSearcher or Index sharding 
approach. Any web links would be helpful.


Regards
Ganesh

- Original Message - 
From: "Anshum" <[EMAIL PROTECTED]>

To: 
Sent: Tuesday, October 07, 2008 12:18 AM
Subject: Re: Single searcher vs Multi Searcher



Hi Ganesh,
About the memory consumption while sorting, it would end up using similar
amounts, perhaps even more.. like in the case of regular parallel
programming algorithms (hoping that you intend to search using a parallel
multi searcher). Would you have to query particular indexes only for a
particular search or would you be searching over all the indexes and then
follow it up by merger (which the parallel multi searcher would do
efficiently).?
Also, I guess 30 indexes would be a little too many, haven't really tried
out those many indexes for a multisearcher.
As far as maintenance of DB is concerned, it might be easy as long as you
don't have any document updates, in which case you'd have to shift the
documents from one DB/index to another (which includes creating an entry 
in

the latest index/DB and deleting the record from the older DB).
I guess you'd have to pilot it, in case memory is an issue in your case 
and

not speed, you could try a regular multisearcher instead of a parallel
multisearcher.
I guess when you say maintenance of the DB gets easier, you mean that the
data in each individual table is controlled (but remember there could be
other bigger hassles like the one mentioned above about moving data 
between

indexes/DB).

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw


On Mon, Oct 6, 2008 at 10:06 AM, Ganesh <[EMAIL PROTECTED]> wrote:


Hello Anshum,

My index is growing 1 million documents per day. Initially i planned to
have a single database but the sorting of one or more fields consumes 
more

RAM. Whether sharding the index would also consume the same.

My application should co-exist with other application of my product and 
my

app could get 1 GB of RAM. Search speed is fine but i need to display the
result in the sorted order.

I thought to keep 7 days of documents in one index and create one more
after the 7 days. After 30 days the first index may get deleted. I need 
to

keep the documents in the index DB for 30 days. My Index DB is in HDD.

I want to the pros and cons of sharding. I think maintance of the DB
becomes easier.

It would be very much helpful, if you share some of your thoughts.

Regards
Ganesh


- Original Message - From: "Anshum" <[EMAIL PROTECTED]>
To: 
Sent: Friday, October 03, 2008 9:48 PM
Subject: Re: Single searcher vs Multi Searcher



 Hi Ganesh,


I have experimented with sharded indexes and they seem to benefit
me(atleast
in my case). I would like to know a few things before I answer your
question:
1. Do you have a reasonable criteria ( a calculated one) to shard the
indexes?
2. How do you plan to split the index? Is it going to be document based
(which I guess it should be as otherwise you would have to build a
complete
distributed system)
3. Do you plan to put your indexes on the RAM or on (physically) 
seperate

HDDs?

Though all said and done, sharded indexes are a good approach, if done 
the

right way.
--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw


On Fri, Oct 3, 2008 at 3:01 PM, Ganesh <[EMAIL PROTECTED]> wrote:

 Hello all,


My indexing is growing by 1 million records per day and the memory
consumption of the searcher object is quite high.

There are different opinion in the groups. Few suggest to use single
database and few to use sharding. My Database has 10 million records 
now

and
it might go till 30 million or more. I plan to shard the index. but
Multisearcher will give me benifit.

Regards
Ganesh


Send instant messages to your online friends
http://in.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Send instant messages to your online friends 
http://in.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






Send instant messages to your online friends http://in.messenger.yahoo.com 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ArrayIndexOutOfBoundsException in FastCharStream.readChar

2008-10-07 Thread Edwin Smith
I found it. My reader was returning 0 at the end of the stream instead of -1. 
Doh.
 
Thanks again for the suggestions. They did ultimately lead me to the right 
answer.
 
Ed



- Original Message 
From: Edwin Smith <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, October 7, 2008 10:43:06 AM
Subject: Re: ArrayIndexOutOfBoundsException in FastCharStream.readChar

Thanks for the tip. I tried your experiment and, sure enough, it works just 
fine, so it's not the contents but obviously some other behavior of my custom 
reader. (Does the analyzer require that "mark" and "reset" be implemented, for 
example? I did not implement them.)

The stack trace is as follows:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsExceptionat 
java.lang.System.arraycopy(
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(
at org.apache.lucene.analysis.standard.StandardTokenizer.next(
at org.apache.lucene.analysis.standard.StandardFilter.next(
at org.apache.lucene.analysis.LowerCaseFilter.next(
at org.apache.lucene.analysis.StopFilter.next(
at org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.invertField(
at org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.processField(
at org.apache.lucene.index.DocumentsWriter$ThreadState.processDocument(
at org.apache.lucene.index.DocumentsWriter.updateDocument(
at org.apache.lucene.index.DocumentsWriter.addDocument(
at org.apache.lucene.index.IndexWriter.addDocument(
at org.apache.lucene.index.IndexWriter.addDocument(
at com.affinovate.v4.server.search.ServerTest.main(Native 
Method)StandardTokenizerImpl.java:366)StandardTokenizerImpl.java:573)StandardTokenizer.java:139)StandardFilter.java:42)LowerCaseFilter.java:33)StopFilter.java:118)DocumentsWriter.java:1522)DocumentsWriter.java:1412)DocumentsWriter.java:1121)DocumentsWriter.java:2442)DocumentsWriter.java:2424)IndexWriter.java:1464)IndexWriter.java:1442)ServerTest.java:36)

The error is occuring in the second line of zzRefill():

System.arraycopy(
 
I set a breakpoint to catch it before it erred, and the value of zzEndRead is 0 
and the value of zzStartRead 1. Thus the error.
 
I was being clever and made a custom reader using competing threads against a 
SAX parser. I probbably did it more to see if I could than for any valid 
reason, so I will probably just simplify my approach and use the parser to pull 
a complete string and use a string reader like you suggest.
 
Thanks for the help.
 
Ed
zzBuffer, zzStartRead,  zzBuffer, 
0,  zzEndRead-zzStartRead);


- Original Message 
From: Michael McCandless <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, October 7, 2008 5:12:05 AM
Subject: Re: ArrayIndexOutOfBoundsException in FastCharStream.readChar


If you capture the exact text produced by the reader, and wrap it in a  
StringReader and pass that to StandardAnalyzer, do you then see the  
same exception?

Can you post the full stack trace on 2.3.2?

Mike

Edwin Smith wrote:

> I upgraded to the latest, 3.3.2 and had the same problem, even  
> though it was clearly a different lexer reading the text.
>
> I did find some problems with the reader I was using, and it now  
> reads some files that it didn't before, so it may still be some  
> reader problem I haven't identified, but the text coming in from it  
> looks correct to me, so I don't know.
>
> Very frustrating.
>
> Ed
>
>
>
> - Original Message 
> From: Edwin Smith <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Monday, October 6, 2008 3:20:51 PM
> Subject: Re: ArrayIndexOutOfBoundsException in FastCharStream.readChar
>
> No particular reason. It is just what I had loaded last and hadn't  
> upgraded. It sounds like there might be good reason to do that now.
>
> Thanks for the tip.
>
> Ed
>
>
>
> - Original Message 
> From: Steven A Rowe <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Monday, October 6, 2008 3:18:20 PM
> Subject: RE: ArrayIndexOutOfBoundsException in FastCharStream.readChar
>
> Hi Edwin,
>
> I don't know specifically what's causing the exception you're  
> seeing, but note that in Lucene 2.3.0+, the JavaCC-generated version  
> of StandardTokenizer (where your exception originates) has been  
> replaced with a JFlex-generated version - see 
>  >.
>
> FYI, indexing speed was much improved in 2.3.0 over previous  
> versions -- up to 10 times faster, according to reports on this list  
> -- is there any particular reason you aren't using 2.3.2 (the most  
> recent release)?
>
> Steve
>
> On 10/06/2008 at 2:32 PM, Edwin Smith wrote:
>> Oh, and in case it matters, I'm using Lucene 2.2.0.
>>
>> Ed
>>
>>
>>
>> - Original Message 
>>
>>
>> I am stumped and have not seen any other reference to this
>> problem. 

Only last field indexed

2008-10-07 Thread John Griffin
Guys,

I'm adding multiple fields with the same name to a document as Store.YES,
Indexed.TOKENIZED and it seems that only the last field entered is indexed.
I read about this somewhere her but now I can't find it, naturally. Is there
a work around? does someone have a pointer to this discussion? Can someone
help?

Thanks in advance.

John G.


Re: Re-tokenized fields disappear

2008-10-07 Thread John G

Thanks Erick,

Yes PerFieldAnalyzerWrapper is my friend :>).

Another related question, I'm putting these values into a document in fields
with the same name. 'codesearch' e.g.

"codesearch", "B05 1"
"codesearch", "Q070301 4" etc.

I read where only the last field entered is actually indexed but I can't
find that post now. Is this true? How can I get around it?

Thanks again.

John G.


John Griffin-3 wrote:
> 
> My previous question may be moot but as is it is still a problem. Here's a
> little more info on my problem. The same named fields contain two pieces
> of
> information, a code "B05" and a value "1" as follows. The value can be a
> range such as 1 to 5 or 1 to 100.
> 
>  
> 
> "codesearch", "B05 1"
> 
>  
> 
> This field and other identically names but differently valued fields in
> the
> same document are related to a specific person as identified by another
> field say SSN. So, one person can have multiple code searches. Both of the
> codesearch values are related to one another and must be searchable such
> as 
> 
>  
> 
> Return all persons with a codesearch value of B05 ranging from 1 to 3. 
> 
>  
> 
> How can I go about this? Do these codesearch fields need to be in a
> separate
> index related by SSN? 
> 
>  
> 
> Thanks in advance.
> 
>  
> 
> John G.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Re-tokenized-fields-disappear-tp19850534p19863821.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Only last field indexed

2008-10-07 Thread Erick Erickson
Let's see the indexing code. It is perfectly reasonable to
add data to a field multiple times, so I suspect you're
doing something wrong.

What evidence do you have that it's only the last field that's
indexed?

Best
Erick

On Tue, Oct 7, 2008 at 1:28 PM, John Griffin <[EMAIL PROTECTED]>wrote:

> Guys,
>
> I'm adding multiple fields with the same name to a document as Store.YES,
> Indexed.TOKENIZED and it seems that only the last field entered is indexed.
> I read about this somewhere her but now I can't find it, naturally. Is
> there
> a work around? does someone have a pointer to this discussion? Can someone
> help?
>
> Thanks in advance.
>
> John G.
>


Re: Re-tokenized fields disappear

2008-10-07 Thread Erick Erickson
See below (and your other mail)

On Tue, Oct 7, 2008 at 1:59 PM, John G <[EMAIL PROTECTED]> wrote:

>
> Thanks Erick,
>
> Yes PerFieldAnalyzerWrapper is my friend :>).
>
> Another related question, I'm putting these values into a document in
> fields
> with the same name. 'codesearch' e.g.
>
> "codesearch", "B05 1"
> "codesearch", "Q070301 4" etc.
>
> I read where only the last field entered is actually indexed but I can't
> find that post now. Is this true? How can I get around it?


No, that's not true. That's what PositionIncrementGap is all about, handling
this very situation.


>
> Thanks again.
>
> John G.
>
>
> John Griffin-3 wrote:
> >
> > My previous question may be moot but as is it is still a problem. Here's
> a
> > little more info on my problem. The same named fields contain two pieces
> > of
> > information, a code "B05" and a value "1" as follows. The value can be a
> > range such as 1 to 5 or 1 to 100.
> >
> >
> >
> > "codesearch", "B05 1"
> >
> >
> >
> > This field and other identically names but differently valued fields in
> > the
> > same document are related to a specific person as identified by another
> > field say SSN. So, one person can have multiple code searches. Both of
> the
> > codesearch values are related to one another and must be searchable such
> > as
> >
> >
> >
> > Return all persons with a codesearch value of B05 ranging from 1 to 3.
> >
> >
> >
> > How can I go about this? Do these codesearch fields need to be in a
> > separate
> > index related by SSN?
> >
> >
> >
> > Thanks in advance.
> >
> >
> >
> > John G.
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Re-tokenized-fields-disappear-tp19850534p19863821.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: advice on using Lucene for sorting based on payloads

2008-10-07 Thread Grant Ingersoll

Not sure if I fully get it, but bear with me...

Inline below.

On Oct 6, 2008, at 11:37 PM, Alexander Devine wrote:


Hi Luceners,

I have a particular sorting problem and I wanted some advice on what  
the

best implementation approach would be. We currently use Lucene as the
searching engine for our vacation rental website. Each vacation rental
property is represented by a single document in Lucene. We want to  
add a
feature that allows users to sort results by price. The problem is  
that each
rental property can potentially have a different price for each day.  
For
example, many rental properties charge more on weekends, or higher  
rates for
on-season vs. off-season. When a user performs a search, they can  
specify
the start and end dates of their desired travel, so we should be  
able to
calculate the total price for that time period for each property by  
adding

up the price for each day.

I was thinking of implementing this using payloads. Each document  
would have
a "priceByDate" field and there would be one term stored for each  
day for
the next 2 years (which is as far out as we support booking a  
property).


Don't you then have to update every document every day?



The
payload associated with each term would be the price, and then when
searching I could use those payloads as basis for scoring the docs  
using a
BoostingTermQuery. For example, suppose someone was searching for  
travel

dates Dec 1 - Dec 5, I would create a query like so:

BooleanQuery query = new BooleanQuery();
query.add(new PriceBoostingTermQuery(new Term("priceByDate",  
"20081201")),

BooleanClause.Occur.MUST);
query.add(new PriceBoostingTermQuery(new Term("priceByDate",  
"20081202")),

BooleanClause.Occur.MUST);
query.add(new PriceBoostingTermQuery(new Term("priceByDate",  
"20081203")),

BooleanClause.Occur.MUST);
query.add(new PriceBoostingTermQuery(new Term("priceByDate",  
"20081204")),

BooleanClause.Occur.MUST);

PriceBoostingTermQuery would be a subclass of BoostingTermQuery that
overrides getSimilarity() to return a Similarity with a custom  
scorePayload

method that scores based on the price.

Does this approach sound reasonable? Can anyone think of a better  
approach?
One thing I don't understand is that the score needs to be the SUM  
(or the
average) of all the payloads - how does the BooleanQuery handle  
that? Also,
I need to get the calculated total price back to the caller, and I'm  
worried

that making priceByDate a stored field will have negative performance
implications. Perhaps there is some way I could just return the  
calculated

price as the score and then get it from the ScoreDoc?

Thanks for any and all help, and a huge thank you to all the Lucene  
devs for

a great product. The reason I'm trying to solve this problem in Lucene
instead of a database is because Lucene is so much faster for our  
queries

and I don't want to add a DB into the mix.




I think you would be better off with a Function Query (see the  
org.apache.search.function package), but I am not sure.


How do you calculate the cost of the rental?  Is there some way to  
just factor that into the scoring process?  I think if you could do  
this, then you could implement a Function Query to do so.


You might look at Solr's function query capabilities as well.



Alex


--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: advice on using Lucene for sorting based on payloads

2008-10-07 Thread Alexander Devine
Thanks very much for your response, and for pointing me in the direction
towards Function Queries - you saved me a ton of time! You're right, that
seems to be a much better fit for what I am doing. My responses are below.

On Tue, Oct 7, 2008 at 9:11 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

> Not sure if I fully get it, but bear with me...
>
> Inline below.
>
> On Oct 6, 2008, at 11:37 PM, Alexander Devine wrote:
>
>  Hi Luceners,
>>
>> I have a particular sorting problem and I wanted some advice on what the
>> best implementation approach would be. We currently use Lucene as the
>> searching engine for our vacation rental website. Each vacation rental
>> property is represented by a single document in Lucene. We want to add a
>> feature that allows users to sort results by price. The problem is that
>> each
>> rental property can potentially have a different price for each day. For
>> example, many rental properties charge more on weekends, or higher rates
>> for
>> on-season vs. off-season. When a user performs a search, they can specify
>> the start and end dates of their desired travel, so we should be able to
>> calculate the total price for that time period for each property by adding
>> up the price for each day.
>>
>> I was thinking of implementing this using payloads. Each document would
>> have
>> a "priceByDate" field and there would be one term stored for each day for
>> the next 2 years (which is as far out as we support booking a property).
>>
>
> Don't you then have to update every document every day?
>

No, I should have been more clear. At any point in time a property has at
most 2 years of future pricing data, and as each day passes there is a day
less of pricing data. We update our documents when our rental owners make a
change to their data, and we also have some automated processes that cause
rental data to be updated, so in general every document gets updated about
every 1-2 months.

>
>
>  The
>> payload associated with each term would be the price, and then when
>> searching I could use those payloads as basis for scoring the docs using a
>> BoostingTermQuery. For example, suppose someone was searching for travel
>> dates Dec 1 - Dec 5, I would create a query like so:
>>
>> BooleanQuery query = new BooleanQuery();
>> query.add(new PriceBoostingTermQuery(new Term("priceByDate", "20081201")),
>> BooleanClause.Occur.MUST);
>> query.add(new PriceBoostingTermQuery(new Term("priceByDate", "20081202")),
>> BooleanClause.Occur.MUST);
>> query.add(new PriceBoostingTermQuery(new Term("priceByDate", "20081203")),
>> BooleanClause.Occur.MUST);
>> query.add(new PriceBoostingTermQuery(new Term("priceByDate", "20081204")),
>> BooleanClause.Occur.MUST);
>>
>> PriceBoostingTermQuery would be a subclass of BoostingTermQuery that
>> overrides getSimilarity() to return a Similarity with a custom
>> scorePayload
>> method that scores based on the price.
>>
>> Does this approach sound reasonable? Can anyone think of a better
>> approach?
>> One thing I don't understand is that the score needs to be the SUM (or the
>> average) of all the payloads - how does the BooleanQuery handle that?
>> Also,
>> I need to get the calculated total price back to the caller, and I'm
>> worried
>> that making priceByDate a stored field will have negative performance
>> implications. Perhaps there is some way I could just return the calculated
>> price as the score and then get it from the ScoreDoc?
>>
>> Thanks for any and all help, and a huge thank you to all the Lucene devs
>> for
>> a great product. The reason I'm trying to solve this problem in Lucene
>> instead of a database is because Lucene is so much faster for our queries
>> and I don't want to add a DB into the mix.
>>
>>
>
> I think you would be better off with a Function Query (see the
> org.apache.search.function package), but I am not sure.
>
> How do you calculate the cost of the rental?  Is there some way to just
> factor that into the scoring process?  I think if you could do this, then
> you could implement a Function Query to do so.


Yep, after looking at the function package this looks exactly like what I
want to do. Essentially each rental has a set of "rate periods" which
specify the rates between specific dates, e.g.


We want to look at the intersection between the user's desired travel dates
and the ratePeriods to determine the total cost. The thing I like about the
Function queries is this lets us get as arbitrarily complex as we want with
our business rules for the price calculation, and we should be able to tweak
these rules to tradeoff between quoted price accuracy/implementation
simplicity/performance.

Thanks,
Alex

>
>
> You might look at Solr's function query capabilities as well.
>

>
>  Alex
>>
>
> --
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
> -