Re: managing docids for ParallelReader (was Augmenting an existing index)

2005-05-31 Thread Markus Wiederkehr
On 5/31/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > Matt Quail wrote: > > I have wondered about this as well. Are there any *sure fire* ways of > > creating (and updating) two indices so that doc numbers in one index > > deliberately correspond to doc numbers in the other index? > > If you add t

Re: Indexing multiple languages

2005-05-31 Thread jian chen
Hi, Erik, Thanks for your info. No, I haven't tried it yet. I will give it a try and maybe produce some Chinese/English text search demo online. Currently I used Lucene as the indexing engine for Velocity mailing list search. I have a demo at www.jhsystems.net. It is yet another mailing list s

Re: Indexing multiple languages

2005-05-31 Thread Erik Hatcher
Robert, I'm very likely going to be using DSpace and some related technologies from the SIMILE project very soon :) On May 31, 2005, at 5:08 PM, Tansley, Robert wrote: Hi all, The DSpace (www.dspace.org) currently uses Lucene to index metadata (Dublin Core standard) and extracted full-text

Re: Indexing multiple languages

2005-05-31 Thread Erik Hatcher
Jian - have you tried Lucene's StandardAnalyzer with Chinese? It will keep English as-is (removing stop words, lowercasing, and such) and separate CJK characters into separate tokens also. Erik On May 31, 2005, at 5:49 PM, jian chen wrote: Hi, Interesting topic. I thought about this

Re: Adding to the termFreqVector

2005-05-31 Thread Ryan Skow
Adding new terms and re-indexing the document is the desired behavior. One (non-scalable) solution would be to parse the toString of the termFreqVector (freq {myTermField: red/2, green/1, blue/1}) and create a new string representation of the expanded terms: (red red green blue) This obviously

Re: Indexing multiple languages

2005-05-31 Thread jian chen
Hi, Interesting topic. I thought about this as well. I wanted to index Chinese text with English, i.e., I want to treat the English text inside Chinese text as English tokens rather than Chinese text tokens. Right now I think maybe I have to write a special analyzer that takes the text input, and

Indexing multiple languages

2005-05-31 Thread Tansley, Robert
Hi all, The DSpace (www.dspace.org) currently uses Lucene to index metadata (Dublin Core standard) and extracted full-text content of documents stored in it. Now the system is being used globally, it needs to support multi-language indexing. I've looked through the mailing list archives etc. and

Re: Adding to the termFreqVector

2005-05-31 Thread Grant Ingersoll
Is your intent to persist the changed vector somehow or just use it in your application for the immediate search? TermFreqVector is an interface, so if you aren't persisting, I would write a wrapper class around the one that is returned by Lucene that has add/set methods on it for manipulating the

Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Chris Hostetter
have you tried the suggestion i made regarding FieldCache from the first thread in which you asked this question? http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/[EMAIL PROTECTED] : Date: Tue, 31 May 2005 11:42:46 -0700 : From: Kevin Burton <[EMAIL PROTECTED]> : Reply-T

Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Kevin Burton
Andrew Boyd wrote: How about using range query? private Term begin, end; begin = new Term("dateField", DateTools.dateToString(Date.valueOf(<"backInTimeStringDate">))); end = new Term("dateField", DateTools.dateToString(Date.valueOf(<"farFutureStringDate">))); Ha.. crap. That won't wor

Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Chris Lamprecht
Lucene rewrites RangeQueries into a BooleanQuery containing a bunch of OR'd terms. If you have too many terms (dates in your case), you will run into a TooManyClauses exception. I think the default is about 1024; you can set it with BooleanQuery.setMaxClauseCount(). On 5/31/05, Kevin Burton <[EM

Re: managing docids for ParallelReader (was Augmenting an existing index)

2005-05-31 Thread Doug Cutting
Matt Quail wrote: I have a similar problem, for which ParallelReader looks like a good solution -- except for the problem of creating a set of indices with matching document numbers. I have wondered about this as well. Are there any *sure fire* ways of creating (and updating) two indices so

Re: Stemming at Query time

2005-05-31 Thread Daniel Naber
On Monday 30 May 2005 18:54, Andrew Boyd wrote: >   Now that the QueryParser knows about position increments has anyone > used this to do stemming at query time and not at indexing time?  I > suppose one would need a reverse stemmer.  Given the query breath it > would need to inject breathe, breat

Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Kevin Burton
Andrew Boyd wrote: How about using range query? private Term begin, end; begin = new Term("dateField", DateTools.dateToString(Date.valueOf(<"backInTimeStringDate">))); end = new Term("dateField", DateTools.dateToString(Date.valueOf(<"farFutureStringDate">))); RangeQuery query = new RangeQ

Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Tony Schwartz
Only way I see to do this is to get a TermEnum for that field, and grab the first. Then iterate until you find the last one. This is similar behavior to the TermEnum.skipTo method. A better solution would be to record the minimum and maximum dates in the index as you index them. Each time yo

Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Andrew Boyd
How about using range query? private Term begin, end; begin = new Term("dateField", DateTools.dateToString(Date.valueOf(<"backInTimeStringDate">))); end = new Term("dateField", DateTools.dateToString(Date.valueOf(<"farFutureStringDate">))); RangeQuery query = new RangeQuery(begin, end, true)

Re: Indexing multiple keywords in one field?

2005-05-31 Thread Erik Hatcher
On May 31, 2005, at 4:06 AM, Paul Libbrecht wrote: Le 30 mai 05, à 22:13, Doug Hughes a écrit : Ok, so more than one keyword can be stored in a keyword field. Interesting! Yes, yes, yes!! You can do: doc.add("link","xx"); doc.add("link","yy"); Well, that's not quite correct API, but

Re: Indexing multiple keywords in one field?

2005-05-31 Thread Paul Libbrecht
Le 30 mai 05, à 22:13, Doug Hughes a écrit : Ok, so more than one keyword can be stored in a keyword field. Interesting! Yes, yes, yes!! You can do: doc.add("link","xx"); doc.add("link","yy"); and matches will match any of them! I found this in the book and not in the javadoc and I'd recomm

Re: Stemming at Query time

2005-05-31 Thread Paul Libbrecht
You'd only need position-increment if using phrase-query... otherwise... positions are quite much ignored and you can expand the query with an or. Eg, I'd do expand the query for breath to: Term(breath)^2 or (Term(breathes) or Term(breathe) or Term(breathing)) I am not sure you can make a phra

Finding minimum and maximum value of a field?

2005-05-31 Thread Kevin Burton
I have an index with a date field. I want to quickly find the minimum and maximum values in the index. Is there a quick way to do this? I looked at using TermInfos and finding the first one but how to I find the last? I also tried the new sort API and the performance was horrible :-/ Any i