Re: snowball (english) and filenames

2007-05-17 Thread Doron Cohen
> a.b.c.d.e.f.g.h is not broken apart like how the snowball demo > indicates it should do. I am not sure about the "should" here - the way I see it, this is just how the demo works: Snowball stemmers operate on words, so the demo first breaks the input text into words and only then applies stemmin

Re: Memory leak (JVM 1.6 only)

2007-05-17 Thread Stephen Gray
Thanks. If the extra memory allocated is native memory I don't think jconsole includes it in "non-heap" as it doesn't show this as increasing, and jmap/jhat just dump/analyse the heap. Do you know of an application that can report native memory usage? Thanks, Steve Doron Cohen wrote: Stephen

Re: Memory leak (JVM 1.6 only)

2007-05-17 Thread Doron Cohen
Stephen Gray <[EMAIL PROTECTED]> wrote on 17/05/2007 22:40:01: > One interesting thing is that although the memory allocated as > reported by the processes tab of Windows Task Manager goes up and up, > and the JVM eventually crashes with an OutOfMemory error, the total size > of heap + non-heap as

Re: Memory leak (JVM 1.6 only)

2007-05-17 Thread Stephen Gray
Hi Otis, Thanks very much for your reply. I've removed the LuceneIndexAccessor code, and still have the same problem, so that at least rules out LuceneIndexAccessor as the source. maxBufferedDocs is just set to the default, which I believe is 10. I've tried jconsole, + jmap/jhat for looking

Re: snowball (english) and filenames

2007-05-17 Thread Arnold Leung
On 16-May-07, at 11:00 PM, Doron Cohen wrote: If you enter a.b.c.d.e.f.g.h to that demo you'll see that the demo simply breaks the input text on '.' - that has nothing to do with filenames. That is not what I am seeing from my testing: a.b.c.d.e.f.g.h is not broken apart like how the snowbal

Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Andreas Guther
I found a similar recommendation about the disc access and reading in order in the following message and implemented this in my code: http://www.gossamer-threads.com/lists/lucene/general/28268#28268 Since I am dealing with multiple index directories I sorted the document references by index numbe

Re: How can I limit the number of hits in my query?

2007-05-17 Thread David Leangen
Thank you, Erick, this is very useful! Have you ever taken a look at Google Suggest[1]? It's very fast, and the results are impressive. I think your suggestion will go a long way to fixing my problem, but there's probably still quite a gap between this approach and the kind of results that Google

Re: Memory leak (JVM 1.6 only)

2007-05-17 Thread Otis Gospodnetic
Hi Steve, You said the OOM happens only when you are indexing. You don't need LuceneIndexAccess for that, so get rid of that to avoid one suspect that is not part of Lucene core. What is your maxBufferedDocs set to? And since you are using JVM 1.6, check out jmap, jconsole & friends, they'll

Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Otis Gospodnetic
- Original Message From: Paul Elschot <[EMAIL PROTECTED]> On Thursday 17 May 2007 08:10, Andreas Guther wrote: > I am currently exploring how to solve performance problems I encounter with > Lucene document reads. > > We have amongst other fields one field (default) storing all searchabl

Re: How to ignore scoring for a Query?

2007-05-17 Thread Otis Gospodnetic
Scoring cannot be turned off, currently. I once thought it is possible to skip scoring with the patch in LUCENE-584 JIRA issue, but I was wrong. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message -

Re: Is it possible to use a custom similarity class to cause extra terms in a field to lower score?

2007-05-17 Thread Daniel Einspanjer
Oops. I do indeed have omitNorms turned on. I will re-read the documentation on it and look at turning it off. Sorry for the bother. :/ On 5/17/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : Terminator 2 : Terminator 2: Judgment Day : : And I score them against the query +title:(Terminator

Re: Is it possible to use a custom similarity class to cause extra terms in a field to lower score?

2007-05-17 Thread Chris Hostetter
: Terminator 2 : Terminator 2: Judgment Day : : And I score them against the query +title:(Terminator 2) : Would there be some method or combination of methods in Similarity : that I could easily override to allow me to penalize the second item : because it had "unused terms"? that's what the De

Is it possible to use a custom similarity class to cause extra terms in a field to lower score?

2007-05-17 Thread Daniel Einspanjer
If I have two items in an index: Terminator 2 Terminator 2: Judgment Day And I score them against the query +title:(Terminator 2) they come up with the same score (which makes sense, it just isn't quite what I want) Would there be some method or combination of methods in Similarity that I could

Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Mike Klaas
On 17-May-07, at 6:43 AM, Andreas Guther wrote: I am actually using the FieldSelector and unless I did something wrong it did not provide me any load performance improvements which was surprising to me and disappointing at the same time. The only difference I could see was when I returned

Re: date window

2007-05-17 Thread Chris Hostetter
: A particular document can have several date windows. : Give a specific date, only return those documents where that date : falls within at least one of those windows. : Also, note that there are multiple windows here for a single : document, we can't just search between min start and max end. T

Re: Indexing Open Office documents

2007-05-17 Thread Enis Soztutar
These is a parser for open office in Nutch. It is a plugin called parse-oo. You can find more information in the nutch mailing lists. On 5/17/07, jim shirreffs <[EMAIL PROTECTED]> wrote: Anyone know how to add OpenOffice document to a Lucene index? Is there a parser for OpenOffice? thanks in

Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Erick Erickson
h. Now that I re-read your first mail, something else suggests itself. You stated: "We have amongst other fields one field (default) storing all searchable fields". Do you need to store this field at all? You can search fields that are indexed but NOT stored. I've used something of the same

Indexing Open Office documents

2007-05-17 Thread jim shirreffs
Anyone know how to add OpenOffice document to a Lucene index? Is there a parser for OpenOffice? thanks in advance jim s. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Andreas Guther
I am actually using the FieldSelector and unless I did something wrong it did not provide me any load performance improvements which was surprising to me and disappointing at the same time. The only difference I could see was when I returned for all fields a NO_LOAD which from my understanding is

Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Erick Erickson
Some time ago I posted the results in my peculiar app of using FieldSelector, and it gave dramatic improvements in my case (a factor of about 10). I suspect much of that was peculiar to my index design, so your mileage may vary. See a thread titled... *Lucene 2.1, using FieldSelector speeds up

Re: Group the search results by a given field

2007-05-17 Thread Erick Erickson
There has been significant discussion on this topic (way more than I can remember clearly) on the mail thread, but as I remember it's been referred to as "facet" or "faceted". I think you would get a lot of info searching for these terms at... http://www.gossamer-threads.com/lists/lucene/java-use

How to ignore scoring for a Query?

2007-05-17 Thread Benjamin Pasero
Hi, I have two different use-cases for my queries. For the first, performance is not too critical and I want to sort the results by relevance (score). The second however, is performance critical, but the score for each result is not interesting. I guess, if it was possible to disable scoring for

Re: about to get

2007-05-17 Thread Grant Ingersoll
You can get it from a Hits object (see the id() method) or you can iterate over the docs from 0 to maxDoc -1 (skipping deleted docs) I have some code at http://www.cnlp.org/apachecon2005/ that shows various usages for Term Vector. The Lucene in Action book has some good examples as well.

Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Grant Ingersoll
I haven't tried compression either. I know there was some talk a while ago about deprecating, but that hasn't happened. The current implementation yields the highest level of compression. You might find better results by compressing in your application and storing as a binary field, thus

about to get

2007-05-17 Thread 童小军
Hi lucener: I am want get the TermFreqVector 。but I must get docNum first. titleVector = reader.getTermFreqVector(docNum, "title"); but I can’t get Docnum by lucene Document. how can I get the docNum use Document object? Like this getTermFreqVector(doc,”title”); xiaojun tong 010-64

Group the search results by a given field

2007-05-17 Thread Sawan Sharma
Hi All, I was wondering - is it possible to search and group the results by a given field? For example, I have an index with several million records. Most of them are different Features of the same ID. I'd love to be able to do.. groupby=ID or something like that in the results, and provide the

Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-17 Thread Paul Elschot
On Thursday 17 May 2007 08:10, Andreas Guther wrote: > I am currently exploring how to solve performance problems I encounter with > Lucene document reads. > > We have amongst other fields one field (default) storing all searchable > fields. This field can become of considerable size since we are

date window

2007-05-17 Thread James O'Rourke
Hi All, I've been thinking about this problem for some time now. I'm trying to figure out a way to store date windows in lucene so that I can easily filter as follows. A particular document can have several date windows. Give a specific date, only return those documents where that date fa