Re: Determining the IDF while searching for documents

2005-06-13 Thread Chris Hostetter
I'm not 100% sure I understand your question, but... : order to compute the TF I count the occurences of terms which are : similar to the term. But I've got problems to compute the IDF, because I : must know the number of documents in which the term appears before : searching for the documents (i

Re: Updating documents

2005-06-13 Thread Chris Hostetter
: When I do this all fields that were indexed and/or tokenized but not : stored get lost. : : So is there any way to preserve fields that were not stored? : Reconstructing these fields is to expensive in my application. "preserving" those fields is pretty much the oposite of "not storing" them.

wiki now sends Vary: Cookie (was Re: DBSight, search on database by Lucene)

2005-06-13 Thread Joshua Slive
Paul Querna wrote: Joshua Slive wrote: What we want is for anything with a Cookie: header to totally bypass the cache. I don't know of any way to configure that. Moin should be sending Cache-Control: Private in these cases, in addition to the Vary: Cookie header. If they don't they will

Re: Mobile Lucene

2005-06-13 Thread Dan Funk
It comes with Linux installed - looks like they wrapped their own - no official distribution. You can, with a little effort, put debian on it ( http://www.eleves.ens.fr/home/leurent/zaurus.html ). It doesn't come with Java, but you can get an official version from sun that works very well ( h

Re: Mobile Lucene

2005-06-13 Thread christopher may
What are you running as far as the OS ? And thanks for the responce. From: Dan Funk <[EMAIL PROTECTED]> Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: Mobile Lucene Date: Mon, 13 Jun 2005 15:10:46 -0400 I have Sharp zaurus SL-C3000 running J2me - I was able

Re: Indexes auto creation

2005-06-13 Thread Daniel Naber
On Monday 13 June 2005 18:37, Kadlabalu, Hareesh wrote: > I ran into a related problem; when I create an IndexWriter with a > FSDirectory created with create=true, an existing index would somehow > get corrupted Well, it doesn't get corrupted, it gets deleted. That's what create=true is supposed

Determining the IDF while searching for documents

2005-06-13 Thread Barbara Krausz
Hi all, is it possible to determine the IDF (the documents in which a term appears) while searching for documents? I implemented an index based on trigrams, i.e. the indexterms are now Strings of 3 characters so that my search engine finds documents with OCR-Errors. When I'm searching for the

Re: Hypenated word

2005-06-13 Thread Andy Roberts
On Monday 13 Jun 2005 14:52, Markus Wiederkehr wrote: > On 6/13/05, Andy Roberts <[EMAIL PROTECTED]> wrote: > > On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote: > > > I see, the list of exceptions makes this a lot more complicated than I > > > thought... Thanks a lot, Erik! > > > > I expect yo

Re: Mobile Lucene

2005-06-13 Thread Dan Funk
I have Sharp zaurus SL-C3000 running J2me - I was able to use the current lucene without modification. christopher may wrote: Hey all I am working on a project that requires a search engine on a embedded linux that is also bluetooth capable. Is there a lucene mobile or can I recompile the cod

Re: OutOfMemory when indexing

2005-06-13 Thread Gusenbauer Stefan
Harald Stowasser wrote: >Stanislav Jordanov schrieb: > > > >>High guys, >>Building some huge index (about 500,000 docs totaling to 10megs of plain >>text) we've run into the following problem: >>Most of the time the IndexWriter process consumes a fairly small amount >>of memory (about 32 megs).

Pros/Cons of a split index over a single large index

2005-06-13 Thread Aalap Parikh
Hi, I just a general question: What are the pros and cons of a split index(a number of small indexes) as opposed to a single large index? As I have repeatedly seen in various posts at this group, people have opted for split indexes in cases where they have a large number of documents (say > 1 mil

Re: Indexes auto creation

2005-06-13 Thread Stephane Bailliez
Luke Francl wrote: You may want to try using IndexReader's indexExists family of methods. They will tell you whether or not an index is there. http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#indexExists(org.apache.lucene.store.Directory) Good grief ! I missed th

Re: Hypenated word

2005-06-13 Thread Erik Hatcher
On Jun 13, 2005, at 10:55 AM, Andy Roberts wrote: On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote: I see, the list of exceptions makes this a lot more complicated than I thought... Thanks a lot, Erik! I expect you'll need to do some pre-processing. Read in your text into a buffer

Re: Displaying relevant text with Search results

2005-06-13 Thread Erik Hatcher
On Jun 13, 2005, at 10:58 AM, Kadlabalu, Hareesh wrote: Hi, I have a simple index with one default field that is stored and indexed. I want to display the query results along with some relevant text from the default field, the way search is implemented at http:// www.lucenebook.com/

RE: Indexes auto creation

2005-06-13 Thread Pasha Bizhan
Hi, > From: news [mailto:[EMAIL PROTECTED] On Behalf Of Stephane Bailliez > > What I would like to is something like: if the index does not > exist, then create one for me, otherwise use it. Look at IndexReader.indexExists method. Your code will be like this: bool createIndex = ! (IndexReade

RE: Indexes auto creation

2005-06-13 Thread Pasha Bizhan
Hi, > From: news [mailto:[EMAIL PROTECTED] On Behalf Of Stephane Bailliez > > What I would like to is something like: if the index does not > exist, then create one for me, otherwise use it. Look at IndexReader.indexExists method. Your code will be like this: bool createIndex = ! (IndexReade

RE: Displaying relevant text with Search results

2005-06-13 Thread Pasha Bizhan
Hi, > From: Kadlabalu, Hareesh [mailto:[EMAIL PROTECTED] > However, in order to really do it correctly, one needs to get > to the 'best' > part field's text where the density of searched word(s) is > highest. This could be a very expensive process. Does Lucene > give any help is achieving th

RE: Displaying relevant text with Search results

2005-06-13 Thread Pasha Bizhan
Hi, > From: Kadlabalu, Hareesh [mailto:[EMAIL PROTECTED] > However, in order to really do it correctly, one needs to get > to the 'best' > part field's text where the density of searched word(s) is > highest. This could be a very expensive process. Does Lucene > give any help is achieving th

Re: Indexes auto creation

2005-06-13 Thread Volodymyr Bychkoviak
hello I'm using following code in the startup of my program String indexDirectory = //some init try { if ( !IndexReader.indexExists(indexDirectory)) { // working index doesn't exist so try to create a dummy index. IndexWriter iw = new IndexWriter(indexDirectory, new St

Re: Indexes auto creation

2005-06-13 Thread Luke Francl
You may want to try using IndexReader's indexExists family of methods. They will tell you whether or not an index is there. http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#indexExists(org.apache.lucene.store.Directory)

Re: Indexes auto creation

2005-06-13 Thread Stephane Bailliez
Stephane Bailliez wrote: [...] try { writer = new IndexWriter(directory, analyzer, false) } catch (IOException e){ writer = new IndexWriter(directory, analyzer, true); } On a related note, the code above does not work if the index does not exist because of the lock created by the first

Out of Memory (correction)

2005-06-13 Thread Stanislav Jordanov
A small correction to my last letter: "1000gigs" should be "1000 megs" (sorry) Here's the corrected version: High guys, Building some huge index (about 500,000 docs totaling to 10megs of plain text) we've run into the following problem: Most of the time the IndexWriter process consumes a fairly

RE: Indexes auto creation

2005-06-13 Thread Kadlabalu, Hareesh
I ran into a related problem; when I create an IndexWriter with a FSDirectory created with create=true, an existing index would somehow get corrupted (Luke would come back with a message saying that the index is corrupt). IndexWriter will tell you that it has 0 documents at that stage even though t

Indexes auto creation

2005-06-13 Thread Stephane Bailliez
I have a very stupid question that puzzles me so far in the API. (I'm using Lucene 1.4.3) There is a boolean flag over the creation of the Directory which is basically: use it as is or delete the storage area Same for the index, the IndexWriter use a flag 'use the existing or create a new on

Re: Hypenated word

2005-06-13 Thread Peter A. Friend
On Jun 13, 2005, at 6:18 AM, Markus Wiederkehr wrote: I see, the list of exceptions makes this a lot more complicated than I thought... Thanks a lot, Erik! There is a section about the problems that hyphens create in "Foundations of Statistical Natural Language Processing". Not only are t

Displaying relevant text with Search results

2005-06-13 Thread Kadlabalu, Hareesh
Hi, I have a simple index with one default field that is stored and indexed. I want to display the query results along with some relevant text from the default field, the way search is implemented at http://www.lucenebook.com/ . For example, searching for 'wonderf

Re: Hypenated word

2005-06-13 Thread Markus Wiederkehr
On 6/13/05, Andy Roberts <[EMAIL PROTECTED]> wrote: > On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote: > > I see, the list of exceptions makes this a lot more complicated than I > > thought... Thanks a lot, Erik! > > > > I expect you'll need to do some pre-processing. Read in your text into a

Re: TooManyClauses in BooleanQuery

2005-06-13 Thread Erik Hatcher
On Jun 13, 2005, at 8:44 AM, Harald Stowasser wrote: Harald Stowasser schrieb: P.S. I tried now to use DateFilter. This works, but is very slow on longer Date-Ranges. (30sec. ) Filters in general were meant for one-time creation and caching. If the date ranges are fixed and the index not

Re: OutOfMemory when indexing

2005-06-13 Thread Harald Stowasser
Stanislav Jordanov schrieb: > High guys, > Building some huge index (about 500,000 docs totaling to 10megs of plain > text) we've run into the following problem: > Most of the time the IndexWriter process consumes a fairly small amount > of memory (about 32 megs). > However, as the index size grow

Re: Hypenated word

2005-06-13 Thread Andy Roberts
On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote: > I see, the list of exceptions makes this a lot more complicated than I > thought... Thanks a lot, Erik! > I expect you'll need to do some pre-processing. Read in your text into a buffer, line-by-line. If a given line ends with a hyphen, you

Re: Hypenated word

2005-06-13 Thread Markus Wiederkehr
I see, the list of exceptions makes this a lot more complicated than I thought... Thanks a lot, Erik! Markus On 6/13/05, Erik Hatcher <[EMAIL PROTECTED]> wrote: > > On Jun 13, 2005, at 7:08 AM, Markus Wiederkehr wrote: > > I work on an application that has to index OCR texts of scanned books. >

RE: TooManyClauses in BooleanQuery

2005-06-13 Thread Omar Didi
if you get an OutOfMemoryException, I beleive the only thing you can do is just increase the JVM heap to a larger size. -Original Message- From: Harald Stowasser [mailto:[EMAIL PROTECTED] Sent: Monday, June 13, 2005 8:28 AM To: java-user@lucene.apache.org Subject: Re: TooManyClauses in Boo

Re: OutOfMemory when indexing

2005-06-13 Thread Markus Wiederkehr
I am not an expert, but maybe the occasionally high memory usage is because Lucene is merging multiple index segments together. Maybe it would help if you set maxMergeDocs to 10,000 or something. In your case that would mean that the minimum number of index segments would be 50. But again, this m

Re: TooManyClauses in BooleanQuery

2005-06-13 Thread Harald Stowasser
Harald Stowasser schrieb: P.S. I tried now to use DateFilter. This works, but is very slow on longer Date-Ranges. (30sec. ) signature.asc Description: OpenPGP digital signature

Re: TooManyClauses in BooleanQuery

2005-06-13 Thread Harald Stowasser
[EMAIL PROTECTED] schrieb: > Hi Harald, > > its nice too see, that there are others out there in Germany dealing with > the same problems as we have been doing in the past years :-) > > So for the "too many clauses" problem I have a solution for you, that I > want to share: > Just include some

Re: Hypenated word

2005-06-13 Thread Erik Hatcher
On Jun 13, 2005, at 7:08 AM, Markus Wiederkehr wrote: I work on an application that has to index OCR texts of scanned books. Naturally there occur many words that are hyphenated across lines. I wonder if there is already an Analyzer or maybe a TokenFilter that can merge those syllables back int

OutOfMemory when indexing

2005-06-13 Thread Stanislav Jordanov
High guys, Building some huge index (about 500,000 docs totaling to 10megs of plain text) we've run into the following problem: Most of the time the IndexWriter process consumes a fairly small amount of memory (about 32 megs). However, as the index size grows, the memory usage sporadically burst

Re: TooManyClauses in BooleanQuery

2005-06-13 Thread Erik Hatcher
On Jun 13, 2005, at 7:47 AM, Harald Stowasser wrote: 1. Sorting by Date is ruinously slow. So I deactivated it. How were you sorting by date? 3. I also read that we should save the Date as MMDD-String. I don't like this solution, because I don't know that this will work. And then I h

Updating documents

2005-06-13 Thread Markus Wiederkehr
Hi all, I would like to update a document as follows. 1) retrieve the document from an IndexReader/Searcher 2) delete the document 3) manipulate the document, that is remove and add fields 4) save the document using an IndexWriter When I do this all fields that were indexed and/or tokenized but

Re: TooManyClauses in BooleanQuery

2005-06-13 Thread a . herberger
Hi Harald, its nice too see, that there are others out there in Germany dealing with the same problems as we have been doing in the past years :-) So for the "too many clauses" problem I have a solution for you, that I want to share: Just include somewhere at the very beginning of your program

TooManyClauses in BooleanQuery

2005-06-13 Thread Harald Stowasser
Hello lucene-list readers, first I want to introduce myself a little. Because I am new at this List: I am a programmer in a publishing company, 32 years of Age and you can find my picture at http://www.idowa.de/service/kontakt. We release some local newspapers and a website (http://www.idowa.de)

Hypenated word

2005-06-13 Thread Markus Wiederkehr
Hello, I work on an application that has to index OCR texts of scanned books. Naturally there occur many words that are hyphenated across lines. I wonder if there is already an Analyzer or maybe a TokenFilter that can merge those syllables back into whole words? It looks like Erik Hatcher uses so

Re: Ideas Needed - Finding Duplicate Documents

2005-06-13 Thread Paul Libbrecht
Have you tried comparing TermVectors ? I would expect them, or an adjustment of them, to allow comparison to focus on "important terms" (e.g. about a 100-200 terms) and then allow a more reasonable computation. paul Le 12 juin 05, à 16:37, Dave Kor a écrit : Hi, I would like to poll the c

Re: How to navigate through indexed terms

2005-06-13 Thread Antoine Brun
Hi, thanks for the hint. I guess that the best solution would be to implement a previous() method. I was wondering if anyone has ever planned on doing this? Antoine Brun Of course, you could also add a previous() method into the source and submit the patch, as the code would be very similar t

Some problems with lucene in searching

2005-06-13 Thread sriram Thota
Hi, I am working on lucene.I had seen ur suggestion about lucene in google search.Iam facing some problems in searching.Please go through my sample code and suggest me where i had gone wrong. I will be thankful to you. This is my sample code: private static Document createDocument(File fFi