NativeFSLockFactory problem

2006-10-18 Thread Frank Kunemann
Hi all, I'm trying to use the new class NativeFSLockFactory, but as you can guess I have a problem using it. Don't know what I'm doing wrong, so here is the code: FSDirectory dir = FSDirectory.getDirectory(indexDir, create, NativeFSLockFactory.getLockFactory()); logger.info("Index: "+indexDir.

Re: constructing smaller phrase queries given a multi-word query

2006-10-18 Thread Mekin Maheshwari
Resending, with the hope that the search gurus missed this. Would really appreciate any advise on this. Would not want to reinvent the wheel & I am sure this is something that would have been done. Thanks, mek On 10/16/06, Mek <[EMAIL PROTECTED]> wrote: Has anyone dealt with the problem of con

Re: index architectures

2006-10-18 Thread Doron Cohen
Not sure if this is the case, but you said "searchers", so might be it - you can (and should) reuse searchers for multiple/concurrent queries. IndexSearcher is thread-safe, so no need to have a different searcher for each query. Keep using this searcher until you decide to open a new searcher - act

Re: index architectures

2006-10-18 Thread Paul Waite
Some excellent feedback guys - thanks heaps. On my OOM issue, I think Hoss has nailed it here: > That said: if you are seeing OOM errors when you sort by a field (but > not when you use the docId ordering, or sort by score) then it sounds > like you are keeping refrences to IndexReaders arround

Re: termpositions at index time...

2006-10-18 Thread Erick Erickson
I tried the notion of a temporary RAMDirectory already, and the documents parse unacceptably slowly , 8-10 seconds. Great minds think alike. Believe it or not, I have to deal with a 7,500 page book that details Civil War records of Michigan volunteers. The XML form is 24M, probably 16M of text exc

Re: question regarding usage of IndexWriter.setMaxFieldLength()

2006-10-18 Thread Erick Erickson
I had a similar question a while ago and the answer is "you can't cheat". According to what the guys said, this doc.add("field", ) doc.add("field", ) doc.add("field", ) is just the same as this doc.add("field", ) But go ahead and increase the maxfieldlength. I'm successfully indexing (unstored

Re: near duplicates

2006-10-18 Thread John Casey
On 10/18/06, Isabel Drost <[EMAIL PROTECTED]> wrote: Find Me wrote: > How to eliminate near duplicates from the index? Someone suggested that I > could look at the TermVectors and do a comparision to remove the > duplicates. As an alternative you could also have a look at the paper "Detecting P

question regarding usage of IndexWriter.setMaxFieldLength()

2006-10-18 Thread d rj
Hello- I was wondering about the usage of IndexWriter.setMaxFieldLength() it is limited, by default, to 10k terms per field. Can anyone tell me if this is this a "per field" limit or a "per uniquely named field" limit? I.e. in the following snippet I add many words to different Fields all w/ the

Re: termpositions at index time...

2006-10-18 Thread Michael D. Curtin
Erick Erickson wrote: Arbitrary restrictions by IT on the space the indexes can take up. Actually, I won't categorically I *can't* make this happen, but in order to use this option, I need to be able to present a convincing case. And I can't do that until I've exhausted my options/creativity.

Re: termpositions at index time...

2006-10-18 Thread Erick Erickson
Arbitrary restrictions by IT on the space the indexes can take up. Actually, I won't categorically I *can't* make this happen, but in order to use this option, I need to be able to present a convincing case. And I can't do that until I've exhausted my options/creativity. And this it way keeps fo

Re: termpositions at index time...

2006-10-18 Thread Michael D. Curtin
Erick Erickson wrote: Here's my problem: We're indexing books. I need to a> return books ordered by relevancy b> for any single book, return the number of hits in each chapter (which, of course, may be many pages). 1>If I index each page as a document, creating the relevance on a book basis

termpositions at index time...

2006-10-18 Thread Erick Erickson
Here's my problem: We're indexing books. I need to a> return books ordered by relevancy b> for any single book, return the number of hits in each chapter (which, of course, may be many pages). 1>If I index each page as a document, creating the relevance on a book basis is interesting, but collec

RE: DateTools oddity....

2006-10-18 Thread Paul Snyder
DITTO !!! I like date truncation, but when I store a truncated date, I don't want to retrieve the time in Greenwich, England at midnight of the date I'm truncating in the local machine's time zone. Nothing against the Brits, it just doesn't do me any good to know what time it was over there on th

Re: DateTools oddity....

2006-10-18 Thread Emmanuel Bernard
No, but using a constant timezone is a good thing anyway since the index will not keep track of the info. And will not really care as long as you always use DateTools (index and search). You can always rewrite DateTools with your own timezone, but EDT is bad since it is vulnerable to Day light s

Re: DateTools oddity....

2006-10-18 Thread Michael J. Prichard
Dang it :) Anyway to set timezone? Emmanuel Bernard wrote: DateTools use GMT as a timezone Tue Aug 01 21:15:45 EDT 2006 Wed Aug 02 02:15:45 EDT 2006 Michael J. Prichard wrote: When I run this java code: Long dates = new Long("1154481345000"); Date dada = new Date(dates.longV

Re: DateTools oddity....

2006-10-18 Thread Doug Cutting
Michael J. Prichard wrote: I get this output: Tue Aug 01 21:15:45 EDT 2006 That's August 2, 2006 at 01:15:45 GMT. 20060802 Huh?! Should it be: 20060801 DateTools uses GMT. Doug - To unsubscribe, e-mail: [EMAIL

Re: DateTools oddity....

2006-10-18 Thread Emmanuel Bernard
DateTools use GMT as a timezone Tue Aug 01 21:15:45 EDT 2006 Wed Aug 02 02:15:45 EDT 2006 Michael J. Prichard wrote: When I run this java code: Long dates = new Long("1154481345000"); Date dada = new Date(dates.longValue()); System.out.println(dada.toString()); System.out

DateTools oddity....

2006-10-18 Thread Michael J. Prichard
When I run this java code: Long dates = new Long("1154481345000"); Date dada = new Date(dates.longValue()); System.out.println(dada.toString()); System.out.println(DateTools.dateToString(dada, DateTools.Resolution.DAY)); I get this output: Tue Aug 01 21:15:45 EDT 2006 200608

Scalability Questions

2006-10-18 Thread Guerre Bear
Hello All, Lucene looks very interesting to me. I was wondering if any of you could comment on a few questions: 1) Assuming I use a typical server such as a dual-core dual-processor Dell 2950, about how many files can Lucene index and still have a sub-two-second search speed for a simple search

Re: Lucene 2.0.1 release date

2006-10-18 Thread Peter Keegan
This makes it relatively safe for people to grab a snapshot of the trunk with less >concern about latent bugs. I think the concern is that if we start doing this stuff on trunk now, people that are >accustomed to snapping from the trunk might be surprised, and not in a good way. +1 on this. T

Re: Error using Luke

2006-10-18 Thread Volodymyr Bychkoviak
You can get Lucene 1.9.1 and make Luke use this version. (you need luke.jar not luke-all.jar) version 1.9.1 contains API which is removed in 2.0 version of Lucene (as deprecated) and should still be able to read indexes created by Lucene 2.0 (correct me if I'm wrong) and then run Luke with com

Re: near duplicates

2006-10-18 Thread karl wettin
17 okt 2006 kl. 18.55 skrev Andrzej Bialecki: You need to create a fuzzy signature of the document, based on term histogram or shingles - take a look a the Signature framework in Nutch. There is a substantial literature on this subject - go to Citeseer and run a search for "near duplicate

Re: Error using Luke

2006-10-18 Thread vasu shah
Thank you very much. I have indeed turned off the norms. Is there any new version of Luke that I can use? Thanks, -Vasu Volodymyr Bychkoviak <[EMAIL PROTECTED]> wrote: seems that you created your index with norms turned off and trying to open with luke which can contain older ver

Re: Error using Luke

2006-10-18 Thread Volodymyr Bychkoviak
seems that you created your index with norms turned off and trying to open with luke which can contain older version of lucene. vasu shah wrote: Hi, I am getting this error when accessing my index with Luke. No sub-file with id _1.f0 found Does any one have idea about this??

Error using Luke

2006-10-18 Thread vasu shah
Hi, I am getting this error when accessing my index with Luke. No sub-file with id _1.f0 found Does any one have idea about this?? Any help would be appreciated. Thanks, -Vasu - Stay in the know. Pulse on the new Yaho

Re: index architectures

2006-10-18 Thread Michael D. Curtin
On Wed, 2006-10-18 at 19:05 +1300, Paul Waite wrote: No they don't want that. They just want a small number. What happens is they enter some silly query, like searching for all stories with a single common non-stop-word in them, and with the usual sort criterion of by date (ie. a field) descendi

Re: index architectures

2006-10-18 Thread Joe Shaw
Hi, On Wed, 2006-10-18 at 19:05 +1300, Paul Waite wrote: > No they don't want that. They just want a small number. What happens is > they enter some silly query, like searching for all stories with a single > common non-stop-word in them, and with the usual sort criterion of by date > (ie. a field

Re: index architectures

2006-10-18 Thread Chris Hostetter
: I *think* that if you reduce your result set by, say, a filter, you might : drastically reduce what gets sorted. I'm thinking of something like this : BooleanQuery bq = new BooleanQuery(); : bq.add(Filter for the last N days wrapped in a ConstantScoreQuery, MUST) : bq.add(all the rest of your st

Re: Preventing merging by IndexWriter

2006-10-18 Thread Erick Erickson
Your problem is out of my experience, so all I can suggest is that you search the list archive. I know the idea of faceted searching has been discussed by people with waaay more experience in that realm than I have and, as I remember, there were some links provided I just searched for 'facete

Re: index architectures

2006-10-18 Thread Erick Erickson
No, you've got that right. But there's something I think you might be able to try. Fair warning, I'm remembering things I've read on this list and my memory isn't what it used to be I *think* that if you reduce your result set by, say, a filter, you might drastically reduce what gets sorted.

RE: Preventing merging by IndexWriter

2006-10-18 Thread Johan Stuyts
> > So my questions are: is there a way to prevent the IndexWriter from > > merging, forcing it to create a new segment for each indexing batch? > > Already done in the Lucene trunk: > http://issues.apache.org/jira/browse/LUCENE-672 > > Background: > http://www.gossamer-threads.com/lists/lucene/j

RE: Preventing merging by IndexWriter

2006-10-18 Thread Johan Stuyts
> Why go through all this effort when it's easy to make your > own unique ID? > Add a new field to each document "myuniqueid" and fill it in > yourself. It'll > never change then. I am sorry I did not mention in my post that I am aware of this solution but that it cannot be used for my purposes.