RE: too many file descriptors opened by Lucene shows (deleted) in /proc

2009-09-03 Thread Uwe Schindler
This is normal. When you open an IndexReader/IndexSearcher, it opens various file handles. If you additionally update/add/delete documents in parallel (even in other process), or optimize the index, the original IndexReader stays on using the "old" state of the index. IndexWriter deletes some files

too many file descriptors opened by Lucene shows (deleted) in /proc

2009-09-03 Thread Ganesh
Hello all, In my linux pc, there are too many fd counts for lucene database. /proc//fd shows very big list. I have provided sample below. lr-x--1 root root 64 Sep 3 17:02 360 -> /opt/ganesh/lucenedb/_2w5.tvf (deleted) lr-x--1 root root 64 Sep 3 17:0

Re: First result in the group

2009-09-03 Thread Ganesh
Thanks shai and mark for your suggestions. I initially tried DuplicateFilter and it is not giving me expected results. It removes the duplicates at query time and not in the results. Regards Ganesh - Original Message - From: "mark harwood" To: Sent: Wednesday, September 02, 2009 5:36

Re: Extending Sort/FieldCache

2009-09-03 Thread Shai Erera
Thanks I plan to look into two things, and then probably create two separate issues: 1) Refactor the FieldCache API (and TopFieldCollector) such that one can provide its own Cache of native values. I'd hate to rewrite the FieldComparators logic just because the current API is not extendable. That

Re: Extending Sort/FieldCache

2009-09-03 Thread Chris Hostetter
: I wanted to avoid two things: : * Writing the logic that invokes cache-refresh upon IndexReader reload. Uh... i don't think there is any code that FieldCache refreshing on reload (yet), so you wouldn't be missing out on anything. (as long as your custom cache works at the SegmentReader level

Re: Can this regex be done?

2009-09-03 Thread Robert Muir
just a side note, LUCENE-1606 is intended to address exactly the performance issue that you described. rather than depending upon constant prefix or enumerating terms, it can efficiently skip through the term dictionary. the downside is that this behavior depends upon the ability to compile a reg

Re: Field.Store.NO & Field.Index.NOT_ANALYZED & hashCode

2009-09-03 Thread Chris Hostetter
: As for the exact matching, I am wondering if I should store the hashcode of : the text in a separate field and convert the text in the query to a hashcode : before passing it on or if Lucene already does something like that under the : covers when it sees Field.Store.NO & Field.Index.NOT_ANALYZE

Re: Can this regex be done?

2009-09-03 Thread Chris Hostetter
: Because some of the queries that I have to convert (without modifying : them, unfortunately) have a half literally a page of statements : expressed like that that, if expanded, would equal a several page long : lucene query. FWIW: the RegexQuery (in contrib) applies the regex input to every ter

RE: TokenStream API, Quick Question.

2009-09-03 Thread Uwe Schindler
The indexer only call getAttribute/addAttribute one time after initializing (see docs). It will never call it later. If you cache tokens, you always have to restore the state into the TokenStream's attributes. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u

TokenStream API, Quick Question.

2009-09-03 Thread Daniel Shane
Does a TokenStream have to return always the same number of attributes with the same underlying classes for all the tokens it generates? I mean, during the tokenization phase, can the first "token" have a Term and Offset Attribute and the second "token" only a Type Attribute or does this mean

Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

2009-09-03 Thread Daniel Shane
Ok, I got it, from checking other filters, I should call input.incrementToken() instead of super.incrementToken(). Do you feel this kind of breaks the object model (super.incrementToken() should also work). Maybe when the old API is gone, we can stop checking if someone has overloaded next()

Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

2009-09-03 Thread Daniel Shane
Uwe Schindler wrote: There may be a problem that you may not want to restore the peek token into the TokenFilter's attributes itsself. It looks like you want to have a Token instance returned from peek, but the current Stream should not reset to this Token (you only want to "look" into the next T

Re: New "Stream closed" exception with Java 6

2009-09-03 Thread Grant Ingersoll
On Sep 2, 2009, at 7:45 AM, Chris Bamford wrote: Hi Grant, I have now followed Daniel's advice and catch the exception with: try { indexWriter.addDocument(doc); What does your Document/Field creation code look like? In other words, how do you construct doc? Seems like somethin

Re: Use of tika for parsing, offsets questions

2009-09-03 Thread Grant Ingersoll
On Sep 2, 2009, at 5:40 AM, David Causse wrote: Hi, If I use tika for parsing HTML code and inject parsed String to a lucene analyzer. What about the offset information for KWIC and return to text (like the google cache view)? how can I keep track of the offsets between tika parser and lu

RE: Use of tika for parsing, offsets questions

2009-09-03 Thread Uwe Schindler
An additional good solution for Lucene (from 2.9 on), would be to create a special TIKA analyzer that can be used to directly add TIKA-parseable content and metadata to the Tokenstream as Attributes (using the new API) or only text and offset data (old Lucene TokenStream API). I wrote something si

Re: Use of tika for parsing, offsets questions

2009-09-03 Thread Jukka Zitting
Hi, On Wed, Sep 2, 2009 at 2:40 PM, David Causse wrote: > If I use tika for parsing HTML code and inject parsed String to a lucene > analyzer. What about the offset information for KWIC and return to text > (like the google cache view)? how can I keep track of the offsets > between tika parser and