Index Optimization Issue

2008-03-11 Thread masz-wow
I managed to optimize my index successfully. The problem that I'm having now is when I check the index using Lucene Index Toolbox there are a few files in the index itself is deletable. I understand that optimize method will merge the index files but How come there is still deletable index files i

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Daniel Noll
On Wednesday 12 March 2008 10:20:12 Michael McCandless wrote: > Oh, so you do not see the problem with SerialMergeScheduler but you > do with ConcurrentMergeScheduler? [...] > Oh, there are no deletions?  Then this is very strange.  Is it   > optimize that messes up the docIDs?  Or, is it when you

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Daniel Noll
On Wednesday 12 March 2008 09:53:58 Erick Erickson wrote: > But to me, it always seems...er...fraught to even *think* about relying > on doc ids. I know you've been around the block with Lucene, but do you > have a compelling reason to use the doc ID and not your own unique ID? From memory it was

Re: Help with Fuzzy Queries

2008-03-11 Thread Chris Hostetter
: When I try to look for something limilar with "FALHA DE DISJUNTOR", I've : got the following results: : Result | score : FALHA DE COMANDO | 0.9277342 : ATUAÇÃO FALHA DE DISJUNTOR | 0.8880876 : RESET DE FALHA DE DISJUNTOR | 0.5709133 your best bet to make sense of scoring i

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Michael McCandless
Daniel Noll wrote: On Tuesday 11 March 2008 19:55:39 Michael McCandless wrote: Hi Daniel, 2.3 should be no different from 2.2 in that docIDs only "shift" when a merge of segments with deletions completes. Could it be the ConcurrentMergeScheduler? Merges now run in the background by default

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Erick Erickson
But to me, it always seems...er...fraught to even *think* about relying on doc ids. I know you've been around the block with Lucene, but do you have a compelling reason to use the doc ID and not your own unique ID? Best Erick On Tue, Mar 11, 2008 at 5:39 PM, Daniel Noll <[EMAIL PROTECTED]> wrote:

Re: Query for "Bigger then" specific term

2008-03-11 Thread Chris Hostetter
: I know there's a range query where I can use a large upper bound but maybe : there's something more efficient (instead of Lucene transfrom to query to : thousands of OR queries). If you use ConstantScoreRangeQuery then there is no transformation per term - just a uniform score if hte document

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Daniel Noll
On Tuesday 11 March 2008 19:55:39 Michael McCandless wrote: > Hi Daniel, > > 2.3 should be no different from 2.2 in that docIDs only "shift" when > a merge of segments with deletions completes. > > Could it be the ConcurrentMergeScheduler? Merges now run in the > background by default and commit w

RE: Specialized XML handling in Lucene

2008-03-11 Thread Steven A Rowe
On 03/11/2008 at 11:48 AM, Steven A Rowe wrote: > 5 billion docs is within the range that Lucene can handle. I > think you should try doc = element and see how well it works. Sorry, Eran, I was dead wrong about this assertion. See this thread for more information:

Re: MultiSearcher to overcome the Integer.MAX_VALUE limit

2008-03-11 Thread Otis Gospodnetic
Trust Yonik, and trust me when I say this, please. I, too, would guess that you'll need more than 1 machine (that's what Yonik meant by distributed search) to handle search against a 2B doc index, even if docs are small. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch -

Re: Best way to do Query inflation?

2008-03-11 Thread Otis Gospodnetic
I don't have the source code at hand, but have a look at Solr, as it has support for synonyms (and you could treat those extra terms as synonyms, it seems). You don't have to switch from Lucene to Solr if you don't want to, of course, you could simply look how Solr does it. Otis -- Sematext --

Re: Biggest index

2008-03-11 Thread Otis Gospodnetic
Questions like these are always hard to answer well. Actually, no, they are easy, right Erik: "It depends" ;) Just kidding...partially. Anyhow, you should ask a few more questions then: - what is the response latency? (average, median, Nth percentile...) - are stored fields involved, if so how

Re: Query for "Bigger then" specific term

2008-03-11 Thread Otis Gospodnetic
I don't think there is anything more efficient than thatbut I could be wrong. If you can, consider grouping > 10 values in a small and discrete set of buckets (that you can then OR), if you are concerned with a large disjunction query. Otis -- Sematext -- http://sematext.com/ -- Lucene - So

Re: IndexWriter.doAfterFlush()

2008-03-11 Thread Michael McCandless
Woops -- that was not intentional. It should be called for both additions and deletions, so, it's safe to fix it in your local IndexWriter.java, and I'll open a Jira, add a unit test and fix it in Lucene. Thanks for raising this! Mike Mark Ferguson wrote: Hi everyone, I have written a

IndexWriter.doAfterFlush()

2008-03-11 Thread Mark Ferguson
Hi everyone, I have written an extension of IndexWriter which overrides its doAfterFlush() method. I had no problems with this in Lucene 2.2 but I noticed that in Lucene 2.3 the behavior of doFlush() has been changed so as to only call doAfterFlush() on deletions rather than on both deletions and

Re: Searching for null (empty) fields, how to use -field:[* TO *]

2008-03-11 Thread Erick Erickson
You could also think about making a filter, probably when you open your searcher. You can use TermDocs/TermEnum to find all of the documents that *do* have entries for your field, assemble those into a filter, then invert that filter. Keep the filter around and use it whenever you need to. Perhaps

Re: Unique Fields

2008-03-11 Thread Erick Erickson
You can easily find whether a term is in the index with TermEnum/TermDocs (I think TermEnum is all you really need). Except, you'll probably also have to keep an internal map of IDs added since the searcher was opened and check against that too. Best Erick On Tue, Mar 11, 2008 at 11:04 AM, Ion B

RE: Specialized XML handling in Lucene

2008-03-11 Thread Steven A Rowe
Hi Eran, On 03/11/2008 at 12:26 PM, Eran Sevi wrote: > If I query this index structure and get results from several > xml docs, is there a better way to group results by doc id, > other then iterating on all results, get original document > and check the value of xml_doc_id field? Perhaps a Sort

Query for "Bigger then" specific term

2008-03-11 Thread Eran Sevi
Hi, What's the best way to query Lucene for a "bigger then" term, for example " value > 10". I know there's a range query where I can use a large upper bound but maybe there's something more efficient (instead of Lucene transfrom to query to thousands of OR queries). Thanks, Eran.

Re: Specialized XML handling in Lucene

2008-03-11 Thread Eran Sevi
Thanks Steve for the quick reply, Another question regarding this solution: If I query this index structure and get results from several xml docs, is there a better way to group results by doc id, other then iterating on all results, get original document and check the value of xml_doc_id field?

RE: Specialized XML handling in Lucene

2008-03-11 Thread Steven A Rowe
Hi Eran, see my comments below inline: On 03/11/2008 at 9:23 AM, Eran Sevi wrote: > I would like to ask for suggestions of the best design for > the following scenario: > > I have a very large number of XML files (around 1M). > Each file contains several sections. Each section contains > many ele

Re: Searching for null (empty) fields, how to use -field:[* TO *]

2008-03-11 Thread eks dev
you said, if an Index is optimized, isDeleted() does not present performance problem? I think there is still check for null in synchronized method, can jvm optimize this, I doubt it? - Original Message From: German Kondolf <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tu

Unique Fields

2008-03-11 Thread Ion Badita
Hi, I want to create an index with one unique field. Before inserting a document i must be sure that "unique field" is unique. John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED

Re: Searching for null (empty) fields, how to use -field:[* TO *]

2008-03-11 Thread German Kondolf
Yes, my index is a "full-snapshot" created every "n" hours, there are no incremental updates, so I decided to make another MatchAllDocsQuery taking advantage that my index is read-only and basically removing this checks. Regards Ger [EMAIL PROTECTED] On Tue, Mar 11, 2008 at 11:54 AM, Yonik Seele

Re: Searching for null (empty) fields, how to use -field:[* TO *]

2008-03-11 Thread Yonik Seeley
On Tue, Mar 11, 2008 at 10:41 AM, German Kondolf <[EMAIL PROTECTED]> wrote: > *:* is parsed as a MatchAllDocsQuery? > > I've got some preformance issues in Lucene 2.2 because > MatchAllDocsQuery ask for a "isDeleted()" for every document, I didn't > tried it in 2.3. That will still be the case

Re: IndexSearcher thread safety

2008-03-11 Thread German Kondolf
As Michael said, you can share it, and you should share it, this will improve performance and reuse the internal cache associated to the IndexSearcher (term cache, filters cache, etc). On Tue, Mar 11, 2008 at 7:31 AM, J B <[EMAIL PROTECTED]> wrote: > Hi, > > Are instances of IndexSearcher thread

Re: Searching for null (empty) fields, how to use -field:[* TO *]

2008-03-11 Thread German Kondolf
*:* is parsed as a MatchAllDocsQuery? I've got some preformance issues in Lucene 2.2 because MatchAllDocsQuery ask for a "isDeleted()" for every document, I didn't tried it in 2.3. On Tue, Mar 11, 2008 at 11:34 AM, Mark Miller <[EMAIL PROTECTED]> wrote: > You cannot have a purely negative query l

Re: Searching for null (empty) fields, how to use -field:[* TO *]

2008-03-11 Thread German Kondolf
Hi, I was looking for the same functionality, after a short googling didn't find a solution, I assume it must exist but I finally decided to "fill" those empty fields with a representative "null value", "__null__", this is possible only if you know previously ALL the fields. I'd like to know if t

Re: Searching for null (empty) fields, how to use -field:[* TO *]

2008-03-11 Thread Mark Miller
You cannot have a purely negative query like you can in Solr. Try: *:* -MY_FIELD_NAME:[* TO *] thogau wrote: Hi, I browsed the forum searching for a way to make a query that retrieves document that do not have any value for a given field (say MY_FIELD_NAME). I read several posts advising to

Searching for null (empty) fields, how to use -field:[* TO *]

2008-03-11 Thread thogau
Hi, I browsed the forum searching for a way to make a query that retrieves document that do not have any value for a given field (say MY_FIELD_NAME). I read several posts advising to use this syntax : -MY_FIELD_NAME:[* TO *] However, I am not able to have it working... I have 2 documents, t

Specialized XML handling in Lucene

2008-03-11 Thread Eran Sevi
Hi, I would like to ask for suggestions of the best design for the following scenario: I have a very large number of XML files (around 1M). Each file contains several sections. Each section contains many elements (about 1000-5000). Each element has a value and some attributes describing the value

Re: IndexSearcher thread safety

2008-03-11 Thread Michael McCandless
They are thread safe. You should share a single instance across multiple threads. Mike J B wrote: Hi, Are instances of IndexSearcher thread safe? In other words, should each thread have it's own instance of IndexSearcher, or could I share a single one between many threads, to avoid

IndexSearcher thread safety

2008-03-11 Thread J B
Hi, Are instances of IndexSearcher thread safe? In other words, should each thread have it's own instance of IndexSearcher, or could I share a single one between many threads, to avoid constantly opening and closing new instances? Many thanks! -J.

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-11 Thread Michael McCandless
Hi Daniel, 2.3 should be no different from 2.2 in that docIDs only "shift" when a merge of segments with deletions completes. Could it be the ConcurrentMergeScheduler? Merges now run in the background by default and commit whenever they complete. You can get back to the previous (block