RE: WhitespaceAnalyzer and version

2010-04-12 Thread Uwe Schindler
As of Lucene 3.0, WhitespaceAnalyzer has not yet a Version ctor. It will come in 3.1, when Lucene is changed to be Unicode 4.0 conform (3.0 and before is Unicode 3.0, which is Java 1.4). QueryParser need the Version ctor for the handling of stop words. As WhiteSpace Analyzer does not use StopFi

Re: WhitespaceAnalyzer and version

2010-04-12 Thread Shai Erera
Hi WhitespaceAnalyzer definitely has a Version dependent ctor. What Lucene version do you use? You van use LUCENE_CURRENT but be aware that of a certain Analyzer's behavior has changed in a way that affects your app, you'll need to reindex your data. Usually an Analyzer (or any other Version-awar

Re: IndexWriter and memory usage

2010-04-12 Thread Lance Norskog
There is some bugs where the writer data structures retain data after it is flushed. They are committed as of maybe the past week. If you can pull the trunk and try it with your use case, that would be great. On Mon, Apr 12, 2010 at 8:54 AM, Woolf, Ross wrote: > I was on vacation last week so jus

Understanding lucene indexes and disk I/O

2010-04-12 Thread Burton-West, Tom
Hi all, Please let me know if this should be posted instead to the Lucene java-dev list. We have very large tis files (about 36 GB). I have not been too concerned as I assumed that due to the indexing of the tis file by the tii file, only a small portion of the file needed to be read. However

WhitespaceAnalyzer and version

2010-04-12 Thread Siraj Haider
We are in the process of removing the deprecated api from our code to move to version. One of the deprecation is, the queryparser now expects a version parameter in the constructor. I also have read somewhere that we should pass the same version to analyzer when indexing as wel as when search

Re: How to get the tokens for a given document

2010-04-12 Thread Herbert L Roitblat
Thanks David. I think that I neglected to say that I am using pyLucene 2.4.0. Your suggestion is almost what we're doing: indexReader.getTermFreqVector(ID, fieldName) self.hits = list(self.lSearcher.search(self.query)) if self.hits: self.hit = lucene.Hit.cast_(self.hi

Re: How to get the tokens for a given document

2010-04-12 Thread David Causse
Hi, you are walking from indexReader.terms() then on indexReader.termDocs(Term t) for each term and then match your docID on the termsDocs enum? So you walk the whole index? You need a forward index and lucene is inverted but you have IMHO 2 solutions with lucene (sadly, they both require re-ind

How to get the tokens for a given document

2010-04-12 Thread Herbert Roitblat
Hi, folks. I appreciate the help people have been offering. Here is my problem. My immediate need is to get the tokens for a document from the Lucene index. I have a list of documents that I walk, one at a time. Right now, I am getting the tokens and their frequencies and the problem is that

Re: Removing terms in the Index

2010-04-12 Thread Railan Xisto
And the main objective: when I pass the word "Lucene in Action", it find and remove that term of phrase in the Index, for when I pass the 2nd term ("Lucene"), he does not find that phrase anymore, as has been found the "Lucene in Action" . 2010/4/12 Railan Xisto > Ok. There is a piece of code a

Exception, field is not stored

2010-04-12 Thread Ramon De Paula Marques
Hi guys, I'm trying to use highlighter to a better search on my website, but when the search get documents html and pdf that were indexed with a reader causes an exception that tells the field is not stored. I don't know where to attack now, i must try to index documents storing fields? How to do

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2010-04-12 Thread Herbert Roitblat
Update: reusing the reader and searcher made almost no difference. It still eats up the heap. - Original Message - From: "Herbert L Roitblat" To: Sent: Monday, April 12, 2010 6:50 AM Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded Thank you Michael. Your sugge

Re: Lucene Partition Size

2010-04-12 Thread Ivan Provalov
Thank you, Karl! --- On Fri, 4/9/10, Karl Wettin wrote: > From: Karl Wettin > Subject: Re: Lucene Partition Size > To: java-user@lucene.apache.org > Date: Friday, April 9, 2010, 9:39 AM > It's hard for me to say why this is > slow. > > Here are a few more questions whose anwers might provide >

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2010-04-12 Thread Herbert L Roitblat
Thank you Michael. Your suggestions are helpful. I inherited all of the code that uses pyLucene and don't consider myself an expert on it, so I very much appreciate your suggestions. It does not seem to be the case that these elements represent the index of the collection. TermInfo and Term

Re: How to calculate payloads in queries too

2010-04-12 Thread Mike Schultz
I see the payload in the token now. -- View this message in context: http://n3.nabble.com/How-to-calculate-payloads-in-queries-too-tp712743p713413.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To

Re: Too many open files

2010-04-12 Thread David Causse
After a closer look, I forgot to mention a major clue : it's also the first time we use NRT. I thought IW.getReader() would return a pooled NRT and in fact it returns always a new IR. This should explain the Too many opened files exception. After each addDocument(doc) I prepare a reader with IW.ge

Re: Removing terms in the Index

2010-04-12 Thread Railan Xisto
Ok. There is a piece of code attached.. As I already said, I want to pass that when the term "Lucene in Action" he finds only the 1st sentence. 2010/4/10 Shai Erera > Hi. I'm not sure I understand what you searched for. When you search > for "Lucene in action", do you search it with the quote

Too many open files

2010-04-12 Thread David Causse
Hi, I found a bug in my application, there was no commit at all in all the indexing chain. I noticed thanks to this bug that lucene keeps a file system reference to deleted index files. So after many files indexed I hit a Too many open files. I use a 32 bits 1.6.16 JVM on a linux 64bits system. D

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2010-04-12 Thread Michael McCandless
The large count of TermInfo & Term is completely normal -- this is Lucene's term index, which is entirely RAM resident. In 3.1, with flexible indexing, the RAM efficiency of the terms index should be much improved. While opening a new reader/searcher for every query is horribly inefficient, it sh