Possible bug in FieldSortedHitQueue?

2006-03-16 Thread Paul Cowan
Hi all, I'm loath to stick this in a Jira issue yet, until I've run it past you. I've been looking at it for a while so I'd like to make sure I haven't confused myself beyond belief and it IS actually a problem. It seems to me that there's a possible bug in FieldSortedHitQueue, specifically

Re: Multiple languages - possible approach

2006-03-16 Thread Paul Cowan
Hi Grant and Otis, Thanks for the feedback, I appreciate it. You've given some good ideas. Sounds like a really interesting system! I am curious, are your users fluent in multiple languages or are you using some type of translation component? The former. We're talking about construction pro

Re: restart interrupted index

2006-03-16 Thread Chris Hostetter
: If I interrupt my IndexWriter with a kill signal, must of the time I : will be left with a lock file AND corrupted index files (the searcher : will throw some IllegalStateExceptions after the lock file is : deleted). if you are trying to deal with teh possibility that your indexing process migh

Re: use of ChainedFilter

2006-03-16 Thread Chris Hostetter
: ChainedFilter: [views:[0.4-0.6] level:[1-} ] : : i am concerned about not being able to see the logical operator in the : print string. Should i be able to see the operator? I've never looked at it closely, but a quick glance at the source indicates that the toString does not make any attempt t

Re: restart interrupted index

2006-03-16 Thread Paulo Silveira
Chris, I really would like only this extra files, but I have the same problem here. If I interrupt my IndexWriter with a kill signal, must of the time I will be left with a lock file AND corrupted index files (the searcher will throw some IllegalStateExceptions after the lock file is deleted). P

Re: ConstantScoreRangeQuery and ConstantScoreQuery

2006-03-16 Thread Chris Hostetter
: And I read the following issue again: : : ConstantScoreRangeQuery - fixes "too many clauses" exception : http://issues.apache.org/jira/browse/LUCENE-383 : : But still, I cannot understand very well why ConstantScoreQuery comes out. : Is it for to implement ConstantScoreRangeQuery? Or, is it used

re: restart interrupted index

2006-03-16 Thread Chris Hostetter
: I'm relatively new to Lucene and I've been trying to index a large : number of html files. If my operation is interrupted the index : appears to be corrupted. I can no longer open it for searching with : IndexSearcher (and no amount of toying with Luke's options seems to : help if I try to bro

Re: ConstantScoreRangeQuery and ConstantScoreQuery

2006-03-16 Thread Yonik Seeley
On 3/16/06, Koji Sekiguchi <[EMAIL PROTECTED]> wrote: > But still, I cannot understand very well why ConstantScoreQuery comes out. > Is it for to implement ConstantScoreRangeQuery? Or, is it used for something > by itself? ConstantScoreQuery can wrap any Filter and gives a constant score for every

Lucene job

2006-03-16 Thread Otis Gospodnetic
Hello, Somebody asked me if I knew any good Lucene people who'd be interested in some work that involves a good amount of Lucene... Here is some info. The company is in New York City. Full-time of contractors. Ideally local, but remote work with good candidates may be ok, too. Work involves L

ConstantScoreRangeQuery and ConstantScoreQuery

2006-03-16 Thread Koji Sekiguchi
Hello, At Doug's hand on recent thread "Re: TooManyClauses exception in Lucene (1.4)", I could understand why ConstantScoreRangeQuery was added in Lucene 1.9. I appreciate that. And I read the following issue again: ConstantScoreRangeQuery - fixes "too many clauses" exception http://issues.apach

RE: question...

2006-03-16 Thread Aditya Liviandi
All the index files will be in a file. Has anyone written a module for lucene that provides an alternative IO method? Instead of FSDirectory, it reads out of a stream? -Original Message- From: hu andy [mailto:[EMAIL PROTECTED] Sent: Friday, March 17, 2006 9:24 AM To: java-user@lucene.apa

Re: question...

2006-03-16 Thread hu andy
Do you mean you pack the index files into the file *.luc.If it is the case, Lucene can't read it. If you put index files and *.luc together under some directory, That's OK. Lucene knows how to find these files 2006/3/14, Aditya Liviandi <[EMAIL PROTECTED]>: > > Hi all, > > > > If I want to embed

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Yonik Seeley
On 3/16/06, Nick Atkins <[EMAIL PROTECTED]> wrote: > Hi Yonik, I'm not actually using any IndexReaders, just IndexWriters > and IndexSearchers. An IndexSearcher contains an IndexReader. > I on;y get an IndexReader when I'm doing deletes > but that isn't the case in this test. Opening an IndexR

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Nick Atkins
Hi Yonik, I'm not actually using any IndexReaders, just IndexWriters and IndexSearchers. I on;y get an IndexReader when I'm doing deletes but that isn't the case in this test. I definitely optimize() and close() each IndexWriter when it's done writing its documents (about 200). Anyway, I the pr

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Yonik Seeley
On 3/16/06, Nick Atkins <[EMAIL PROTECTED]> wrote: > Yes, indexing only right now, although I can issue the odd search to > test it's being built properly. Ahh, as Otis suggests, it's probably is IndexReader(s) that are exhausting the file descriptors. Are you explicitly closing the old IndexReade

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Nick Atkins
Yes, indexing only right now, although I can issue the odd search to test it's being built properly. My test (indexing 4+ message's in a user's mailbox) causes BatchUpdater thread to write everything to the index approx every 15-17 seconds. The logs say: [EMAIL PROTECTED] bin]# tail -f ../lo

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Otis Gospodnetic
This happens when you are doing indexing only!? Wow, I've never seen that. Try posting your code in a form of a unit test. Otis - Original Message From: Nick Atkins <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, March 16, 2006 6:28:52 PM Subject: Re: Lucene and To

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Nick Atkins
Hi Doug, I have experimented with a mergeFactor of 5 or 10 (default) but it didn't help matters once I reached the ulimit. I understand how the mergeFactor affects Lucene's performance. I am actually not doing any searches with IndexReader right now, just indexing. Yes, I do store and reuse the

Re: TooManyClauses exception in Lucene (1.4)

2006-03-16 Thread Erick Erickson
Thanks very much for your reply, I appreciate you taking the time. Erick

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Doug Cutting
Are you changing the default mergeFactor or other settings? If so, how? Large mergeFactors are generally a bad idea: they don't make things faster in the long run and they chew up file handles. Are all searches reusing a single IndexReader? They should. This is the other most common reason

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Nick Atkins
Thanks Hannes, on my Fedora machine the maximum I can do is ulimit -n 1048576 which is 1M files. This should be enough for most sane cases but it makes me uneasy. I assume the "deleted" file entries reported by lsof will be cleared up eventually? I can't believe this is really the only option a

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Paulo Silveira
Nick it is a guess, but the only difference between my approach and yours is that I am optimizing as soon as I open the writer, and you are optimizing after the last (100th) document is written. At the same time I am using: writer.setUseCompoundFile(true); writer.

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Hannes Carl Meyer
Hi Nick, use 'ulimit' on your ix system to check if its set to unlimited. check: http://wwwcgi.rdg.ac.uk:8081/cgi-bin/cgiwrap/wsi14/poplog/man/2/ulimit You don't have to set it to unlimited, maybe increasing the number will help. later Hannes Nick Atkins schrieb: Thanks Otis, I tried th

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Nick Atkins
Thanks Otis, I tried that but I still get the same problem at the ulimit -n point. I assume you meant I should call IndexWriter.setUseCompoundFile(true). According to the docs compound structure is the default anyway. Any further thoughts? Anything I can tweak in the OS (Linux), Java (1.5.0) or

Re: fuzzy phrase query?

2006-03-16 Thread karl wettin
16 mar 2006 kl. 11.47 skrev Erik Hatcher: This can be done with some work to implement a SpanFuzzyQuery (similar to the SpanRegexQuery in contrib/regex currently) and using SpanNearQuery instead of a PhraseQuery. Thanks, I'll check it out. Performance is at risk doing such a query as all

re: restart interrupted index

2006-03-16 Thread Michael Dodson
I'm relatively new to Lucene and I've been trying to index a large number of html files. If my operation is interrupted the index appears to be corrupted. I can no longer open it for searching with IndexSearcher (and no amount of toying with Luke's options seems to help if I try to browse

RE: TooManyClauses exception in Lucene (1.4)

2006-03-16 Thread Pasha Bizhan
Hi, > From: Doug Cutting [mailto:[EMAIL PROTECTED] > > The primary advantage of a RangeQuery is that the ranking > incorporates the degree of match of each term in the range, > which may be useful for wildcard-like searches but is useless > for date-like searches. Also, RangeQuery allows to

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Nick Atkins
Thanks Paulo, I actually do something very similar. I have a queue of all pending updates and a Thread that manages the queue. When the queue gets about 100 big or is 30 seconds old (whatever comes sooner) I process it which results in all the Index writes. I also always optimize() and close()

Re: TooManyClauses exception in Lucene (1.4)

2006-03-16 Thread Doug Cutting
Erick Erickson wrote: Could you point me to any explanation of *why* range queries expand this way? It's just what they do. They were contributed a long time ago, before things like RangeFilter or ConstantScoreRangeQuery were written. The latter are relatively recent additions to Lucene and

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Paulo Silveira
Nick! I had also the same problem. Now on my SearchEngine class, when I write a document to the index, I check if the number of documents mod 100 is 0. if it is, optimize(). Optimize() reduces the number of documents used by the index, so the number of open files also is reduced. Take a look:

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Otis Gospodnetic
The easiest first step to try is to go from multi-file index structure to the compound one. Otis - Original Message From: Nick Atkins <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, March 16, 2006 3:00:59 PM Subject: Lucene and Tomcat, too many open files Hi, What'

Lucene and Tomcat, too many open files

2006-03-16 Thread Nick Atkins
Hi, What's the best way to manage the number of open files used by Lucene when it's running under Tomcat? I have a indexing application running as a web app and I index a huge number of mail messages (upwards of 4 in some cases). Lucene's merging routine always craps out eventually with the

Re: TooManyClauses exception in Lucene (1.4)

2006-03-16 Thread Erick Erickson
When I read LIA, I was struck by this issue, and it seemed...er...like an easy mistake to make. Given that my impression of Lucene is that it's extraordinarily well designed, I assume that there must be a good reason for expanding range queries this way. Could you point me to any explanation of *w

Re: IndexFiles

2006-03-16 Thread miki sun
Thank you for reply, my Java run time environment did not work, that's why. It is fixed now. Miki Original Message Follows From: Erik Hatcher <[EMAIL PROTECTED]> Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: IndexFiles Date: Thu, 16 Mar 2006 09:33:37

Re: Can Lucene load more then 2GB into RAM memory?

2006-03-16 Thread Ken Krugler
i wrote a patch for this and the difference is unbelievable, the memory footprint has been cut almost in half and it seems like performance is basically the same if not better!!! if anyone is interested let me know Best approach here is to open up a Jira issue, then submit the patch

Re: TooManyClauses exception in Lucene (1.4)

2006-03-16 Thread Yonik Seeley
On 3/16/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > I had no idea that rangequery worked by enumerating every > possible value, that's terrifying. You could use either a RangeFilter or a ConstantScoreRangeQuery -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Ser

RE: TooManyClauses exception in Lucene (1.4)

2006-03-16 Thread Tim.Wright
Ouch! Yes, we're indexing with seconds, that's almost certainly the problem. :( I had no idea that rangequery worked by enumerating every possible value, that's terrifying. We have a requirement to index data going back for about 20 years, though, and although daily resolution would be fine, this

Re: TooManyClauses exception in Lucene (1.4)

2006-03-16 Thread Otis Gospodnetic
Tim, This is possible a lot of days: date:[2005-03-16 TO 2006-03-16] And if your 'date' field is more granular than 'a day', then this is a lot more hours/minutes/seconds/milliseconds. Your range query is expanded to all unique values in the range. This is probably in the FAQ, but if not, lo

About index deletion

2006-03-16 Thread hu andy
Because I will delete the indexed document periodically, So the index files must be deleted after that. If I just want to delete some documents added before some past day from the index, How should i do it? Thank you in advance

Re: Can Lucene load more then 2GB into RAM memory?

2006-03-16 Thread Doug Cutting
Great! Can you please share your changes? The best way to do this is to: 1. Check Lucene's trunk out from subversion, with: svn co http://svn.apache.org/repos/asf/lucene/java lucene-trunk 2. Make your changes. Use 'svn add' for new files, like unit tests. Please try to conform to the S

Re: Can Lucene load more then 2GB into RAM memory?

2006-03-16 Thread zzzzz shalev
i wrote a patch for this and the difference is unbelievable, the memory footprint has been cut almost in half and it seems like performance is basically the same if not better!!! if anyone is interested let me know Doug Cutting <[EMAIL PROTECTED]> wrote: RAMDirectory is indeed curre

use of ChainedFilter

2006-03-16 Thread Urvashi Gadi
Hi, I am using ChainedFilter to combine various filters. No mattar which logical operator i try to apply to all filters, when i try to print the chained filters using toString() method, i see ChainedFilter: [views:[0.4-0.6] level:[1-} ] i am concerned about not being able to see the logical

TooManyClauses exception in Lucene (1.4)

2006-03-16 Thread Tim.Wright
Hi, We're using queryparser to generate my queries (not ideal, and we're planning on rewriting it, but at the moment we don't have the resources to do so). We have a default field "text" which contains all of our text fields, and a "date" field which is just a string field in the format -MM-

Business stop words?

2006-03-16 Thread Jeff Rodenburg
Does anyone have a lead on "business" stop words? Things like "inc", "llc", "md", etc. I'd rather not reinvent this wheel. :-) cheers, jeff

FunctionQuery example request

2006-03-16 Thread Paul Lynch
Hi, have implemented the DistanceComparatorSource example from Lucene In Action (my Bible) and it works great. We are now in the situation where we have nearly a million documents in our index and the performance of this implementation has degraded. I have downloaded and am trying to understand t

Re: PhraseQuery

2006-03-16 Thread Erik Hatcher
On Mar 16, 2006, at 5:10 AM, Waleed Tayea wrote: I'm using the QueryParser to parse and return a query of a search string of a single word. But the analyzer I uses emits another morphological tokens from that single word. How can I prevent the QueryParser of considering the search query as a P

Re: IndexFiles

2006-03-16 Thread Erik Hatcher
The registry setting is probably irrelevant. What does "java - version" report? Erik On Mar 16, 2006, at 6:07 AM, miki sun wrote: Hi I am trying to use Lucene1.9.1 to index files on my computer. According to the FAQ of the website: - What Java version is required to run Lucene? Luc

Re: Multiple languages - possible approach

2006-03-16 Thread Grant Ingersoll
Hi Paul, Sounds like a really interesting system! I am curious, are your users fluent in multiple languages or are you using some type of translation component? Some comments below and a few thoughts here. How are you querying? Are users entering mixed language queries too? Do you have a

Re: Best design for an use case which is going to stress Lucene

2006-03-16 Thread Michael D. Curtin
Terenzio Treccani wrote: You're both true, this doesn't sound like Lucene at all... But the problem of such SQL tables is their size: speaking about millions of customers and thousands of news items, the many-to-many (CustArt) table would end up by containing BILLIONS of lines A bit too big

IndexFiles

2006-03-16 Thread miki sun
Hi I am trying to use Lucene1.9.1 to index files on my computer. According to the FAQ of the website: - What Java version is required to run Lucene? Lucene 1.4 will run with JDK 1.3 and up but requires at least JDK 1.4 to compile. Lucene >= 1.9 requires Java 1.4. But I got the following error

Re: fuzzy phrase query?

2006-03-16 Thread Erik Hatcher
On Mar 16, 2006, at 2:40 AM, karl wettin wrote: Is it possible to make a phrase query fuzzy? What do you mean by a fuzzy phrase query? As in each term in the phrase is treated as a FuzzyQuery essentially such that "kool kat" matches "cool cat"? This can be done with some work to impleme

Re: Best design for an use case which is going to stress Lucene

2006-03-16 Thread Terenzio Treccani
You're both true, this doesn't sound like Lucene at all... But the problem of such SQL tables is their size: speaking about millions of customers and thousands of news items, the many-to-many (CustArt) table would end up by containing BILLIONS of lines A bit too big even for an Oracle table, I

Re: fuzzy phrase query?

2006-03-16 Thread karl wettin
16 mar 2006 kl. 08.40 skrev karl wettin: Is it possible to make a phrase query fuzzy? It could be a quick and not so dirty replacement for hidden markov models and thus produce great results for spell checking and other natrual language classifications. Perhaps it is easier to make a Spa

PhraseQuery

2006-03-16 Thread Waleed Tayea
Dear All. I'm using the QueryParser to parse and return a query of a search string of a single word. But the analyzer I uses emits another morphological tokens from that single word. How can I prevent the QueryParser of considering the search query as a PhraseQuery with the terms of that single wo

RE: Vector Space Model <-> Probabilistic Model

2006-03-16 Thread Karl Koch
It was published by Norbert Fuhr in the IR Summer Scholl Proceedings. I found it via Google by using the small extention ext:pdf :-) that time... http://www.is.informatik.uni-duisburg.de/bib/pdf/ir/Fuhr:00a.pdf In return, you can do me also a favour and email me (personally, if you like since thi

Re: Reading stop word from a file!

2006-03-16 Thread karl wettin
16 mar 2006 kl. 08.53 skrev Supheakmungkol SARIN: Dear Luceners, I wonder if there is any pre-defined option to read stop-word from a file? /** Builds an analyzer with the stop words from the given file. * @see WordlistLoader#getWordSet(File) */ public StopAnalyzer(File stopwor