"Deleting" documents without deleting them

2010-03-15 Thread Daniel Noll
Hi all. I'm trying to implement a form of document deletion where the previous versions are kept around forever ( a primitive form of versioning) but excluded from the search results. I notice that after calling IndexWriter.deleteDocuments, even if you close and reopen the index, the documents ar

Re: issue querying index.

2010-03-15 Thread Paulo Avelar
Hi Erick, Thanks so much for your explanation. I'm on my third day of Lucene programming, and I got to say, the best thing is this forum. It very nice to get this level of help, specially that quickly. I sense, a great community here. Can't wait to attend Apache Com. (not yet announced) Cheers, P

RE: Increase number of available positions?

2010-03-15 Thread Steven A Rowe
Hi Rene, Have you seen SpanNotQuery?: For a document that looks like: T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 ... ... You could genera

Re: Increase number of available positions?

2010-03-15 Thread Erick Erickson
Not quite what I had in mind, more like level1-1/level2-1/level3-1/Term1 level1-1/level2-1/level3-1/Term2 level1-1/level2-1/level3-2/Term3 level1-1/level2-1/level3-2/Term4 With an increment gap 0f 100 and an analyzer that split on slashes, the term positions would be something like: term term p

Re: Increase number of available positions?

2010-03-15 Thread Rene Hackl-Sommer
Hi Erick, What about indexing the triplets with a small increment gap between? That is: ... gets indexed as: level1-1/level2-1/level3-1 +gap 100 level1-1/level2-1/level3-2 +gap 100 level1-1/level2-2/level3-3 +gap 100 level1-1/level2-2/level3-4 If I understand this correctly, the field w

Re: Increase number of available positions?

2010-03-15 Thread Rene Hackl-Sommer
Hi Steve, Why can't you use a different field for each of the Level_X's, i.e. MyLevel1Field, MyLevel2Field, MyLevel3Field? Well, the hierarchical structure needs to be maintained. As hundreds of Level_X entities can be found on levels 2 and 3, I need to be able to tell for instance whic

Re: Batch Indexing - best practice?

2010-03-15 Thread Erick Erickson
What's a document? What's indexing? Here's what I'd do as a very first step. Time the actual indexing and report it out. By that I mean how long does IndexWriter.addDocument() take? If you actually get the document from wherever first then add all the fields and add the document, I'd time adding t

Re: Increase number of available positions?

2010-03-15 Thread Erick Erickson
I was wondering about Steven's approach to, have you considered it? I don't know the internals of whether you could go to a 64 bit quantity for term positions, but I suspect it would be *very* involved, but perhaps people more familiar with the code could comment. How big is your corpus? Assu

Re: Batch Indexing - best practice?

2010-03-15 Thread Mark Miller
Really depends - StandardAnalyzer is probably a slower analyzer. But for example, with my quad core desktop machine, indexing with 3 or 4 threads, I can do at least a couple hundred wikipedia docs per second (though I'm not using StandardAnalyzer). I'm indexing 10,000 docs in about a minute.

RE: Batch Indexing - best practice?

2010-03-15 Thread Murdoch, Paul
Thanks. I'll try lowering the merge factor and see if speed increases. The indexing is threadedsimilar to the utility class in Listing 10.1 from Lucene in Action. Search speed is great once the index is builtclose to real time. So my main problem is getting the indexing speed fixed. I d

RE: Increase number of available positions?

2010-03-15 Thread Steven A Rowe
Hi Rene, Why can't you use a different field for each of the Level_X's, i.e. MyLevel1Field, MyLevel2Field, MyLevel3Field? On 03/15/2010 at 9:59 AM, Rene Hackl-Sommer wrote: > > > Search in MyField: Terms T1 and T2 on Level_2 and T3, > > > T4, and T5 on Level_3, which should both be in the > > >

Re: Batch Indexing - best practice?

2010-03-15 Thread Ian Lea
See http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for plenty of tips. Suggested by Mike just a few hours ago in another thread ... -- Ian. On Mon, Mar 15, 2010 at 2:41 PM, Murdoch, Paul wrote: > Hi, > > > > I'm using Lucene 2.9.2.  Currently, when creating my index, I'm calling > ind

Re: Batch Indexing - best practice?

2010-03-15 Thread Mark Miller
On 03/15/2010 10:41 AM, Murdoch, Paul wrote: Hi, I'm using Lucene 2.9.2. Currently, when creating my index, I'm calling indexWriter.addDocument(doc) for each Document I want to index. The Documents aren't large and I'm averaging indexing about 500 documents every 90 seconds. I'd like to try

Batch Indexing - best practice?

2010-03-15 Thread Murdoch, Paul
Hi, I'm using Lucene 2.9.2. Currently, when creating my index, I'm calling indexWriter.addDocument(doc) for each Document I want to index. The Documents aren't large and I'm averaging indexing about 500 documents every 90 seconds. I'd like to try and speed this upunless 90 seconds for 50

Re: Increase number of available positions?

2010-03-15 Thread Rene Hackl-Sommer
Is your entire corpus a single document? Because I'm having trouble imagining a single document where this would be a problem, unless your increment gap is huge. The term positions are relative to a single document... It is getting pretty huge, yes (see below). The term positions are also

Re: issue querying index.

2010-03-15 Thread Erick Erickson
No that's not strange at all. You probably opened Luke (or refreshed the reader) after the writer closed. This'll trip you up repeatedly if you don't get it straight, it already has twice. When Lucene opens an IndexReader, pretend that the IR made copied the *current* index to a temporary fil

Re: Increase number of available positions?

2010-03-15 Thread Erick Erickson
Is your entire corpus a single document? Because I'm having trouble imagining a single document where this would be a problem, unless your increment gap is huge. The term positions are relative to a single document... You say that your levels have less than 1,000 elements each With an increment ga

Re: Lucene Indexing out of memory

2010-03-15 Thread Michael McCandless
Try the ideas here? http://wiki.apache.org/lucene-java/ImproveIndexingSpeed Mike On Mon, Mar 15, 2010 at 1:51 AM, ajay_gupta wrote: > > Erick, > I did get some hint for my problem. There was a bug in the code which was > eating up the memory which I figured out after lot of effort. > Thanks

Re: issue querying index.

2010-03-15 Thread Paulo Avelar
Awesome! Thanks a lot for your help, as I build my app I have with me a copy of Lucene in Action :) very nice book indeed. I will verify what you just told me. :) what is strange is that I can see the documents using Luke, (in the debugger I had a break point before the search call and I inspe

Increase number of available positions?

2010-03-15 Thread Rene Hackl-Sommer
Hello, I am working at a use case that is very demanding regarding the number of token positions. For one special field in the index, I need to represent different hierarchy levels, like this: Please note that I need to do this with Lucene, not a XML search engine. Now, on Level_3 there

RE: issue querying index.

2010-03-15 Thread Uwe Schindler
It looks like you have opened the searcher before indexing. This make the searcher only see the empty index as at the time of opening, it contains no documents. For the test you should move the searcher creation after the close of IW. But in a real-world example, you should use IndexReader.reop

Re: issue querying index.

2010-03-15 Thread Paulo Avelar
I would love to, but it's a bit more complicated then that, there are several classes once you see the test you will probably understand why... Here is the test: package net.resumage.se.searcher; import net.resumage.se.FileUtils; import net.resumage.se.extractor.DocumentExtractor; import

RE: issue querying index.

2010-03-15 Thread Uwe Schindler
Can you send us the test code? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Paulo Avelar [mailto:phave...@gmail.com] > Sent: Monday, March 15, 2010 9:24 AM > To: java-user@lucene.apache.org > Subject:

Re: issue querying index.

2010-03-15 Thread Paulo Avelar
Thanks for the answer, But I thought about that, and yes I did close the indexWriter before I search. I experimented with both calling commit and close, but yet I get same behavior. It's like there is a flushing issue, not sure. On Mon, Mar 15, 2010 at 1:21 AM, Uwe Schindler wrote: > I think yo

RE: issue querying index.

2010-03-15 Thread Uwe Schindler
I think you forgot to commit your changes in IndexWriter or have not closed it before creating Searcher/IndexReader. So on the second run, the index is seen, because of the previous run, which was committed on jvm exit. If you are using NearRealtimeSearch (IndexWriter#getIndexReader), please tel

issue querying index.

2010-03-15 Thread Paulo Avelar
Hello, I'm using the latest Lucene 3.0.1. I have written a simple test, which does the usual, creates an index, then add 2 tests documents to it. I'm having a strange problem, first time I run my test, which runs a query I get nothing. but the second time I run my test (exactly the same code) ,