Re: Boolean expression for no terms OR matching a wildcard

2008-07-18 Thread eks dev
Analyzer that detects your condition "ALL match something", if possible at all... e.g. "800123456 80034543534 80023423423" -> 800 than you put it in ALL_MATCH field and match this condition against it... if this prefix needs to be variable, you could extract all matching prefixes to this fiiel

Re: Lucene & XFile interface

2008-07-18 Thread Chris Hostetter
: remote drive. Is there a way to easily modify lucene such that when it reads / : writes from the Index it uses the XFile object instead of File? In this way, : there is a lot more flexibility on where the index can be stored (without : having to rely on operating system mount points). Is it cor

Re: Boolean expression for no terms OR matching a wildcard

2008-07-18 Thread Chris Hostetter
: Maybe this is easier ... suppose what I'm indexing is a phone number, and : there are multiple phone numbers for what I'm indexing under the same field : (phone) and I want the wildcard query to match only records that have either : no phone numbers at all OR where ALL phone numbers are in a spec

Re: How did Lucene clean out the deleted documents from the disk?

2008-07-18 Thread Michael McCandless
That's right, you do not need to run optimize. Over time the disk space will gradually be reclaimed through Lucene's normal merging... Mike dan at gmail wrote: Also, over time, as segments that have marked deletions are merged, the disk space is also reclaimed. Thanks Mike. So can I

RE: Bug in CJKTokenizer

2008-07-18 Thread Scott Smith
I'm certainly not a language expert and so you may be correct. I do see references to some of the eastern European languages in the descriptions of these two and so maybe they should be added as well. -Original Message- From: Steven A Rowe [mailto:[EMAIL PROTECTED] Sent: Friday, July 1

RE: Bug in CJKTokenizer

2008-07-18 Thread Steven A Rowe
Hi Scott, I think this sounds reasonable, but why not also add LATIN_EXTENDED_B and LATIN_EXTENDED_ADDITIONAL? AFAICT, among other things, these cover some eastern European languages and Vietnamese, respectively. Steve On 07/18/2008 at 5:03 PM, Scott Smith wrote: > org.apache.lucene.analysis.

Re: How did Lucene clean out the deleted documents from the disk?

2008-07-18 Thread dan at gmail
> Also, over time, as segments that have marked deletions are merged, > the disk space is also reclaimed. Thanks Mike. So can I say that calling optimzie() is really optional? Because I was worrying that these deleted documents would never get cleaned if I don't run optimize() and eventually

Bug in CJKTokenizer

2008-07-18 Thread Scott Smith
org.apache.lucene.analysis.cjk.CJKTokenizer is in the "contrib" portion of lucene, so I'm not sure if this is the right place to mention this or not. I was doing some detailed analysis of how this tokenizer worked and noticed the following behavior (which I would classify as a bug). If you

Re: MultiSearcher and TopFieldDocCollector

2008-07-18 Thread Chris Hostetter
: The idea was to override TopFieldDocCollector to do the sorting etc. and only : load the full document for those we need to display. But, I haven't found an : easy way to use TopFieldDocCollector (FieldSortedHitQueue etc.) with : MultiSearcher. I don't understand this statement ... i mean, i ha

Re: How did Lucene clean out the deleted documents from the disk?

2008-07-18 Thread Michael McCandless
Either optimize() or expungeDeletes() will reclaim the disk space used by deleted documents. Also, over time, as segments that have marked deletions are merged, the disk space is also reclaimed. Mike dan at gmail wrote: Hello, Could someone please confirm that calling indexWriter.opt

How did Lucene clean out the deleted documents from the disk?

2008-07-18 Thread dan at gmail
Hello, Could someone please confirm that calling indexWriter.optimize() is the only way to clean out the deleted documents from the disk? I understand that indexWriter.deleteDocuments() does not clean the disk space, and I tested that calling after indexWriter.flush() and indexWriter.close() aft

Re: Sorting case-insensitively

2008-07-18 Thread Chris Hostetter
: > if you could submit a test case that ... : See my e-mail dated July 3, 2008. Sorry: i ment open a bug (in Jira) and submit a JUnit test case. I also ment something even simpler so the lower casing doesn't confuse the issue ie: class IdentitySortComparator extends SortCompar

Re: Scaling

2008-07-18 Thread mark harwood
>>I have no clue how large the impact could be I did do some benchmarking of a scoring scheme based on local idf vs one with visibility of a global idf. Using randomized allocation of documents to shards and sufficient volumes of content in each index, the local idf policy produced identical to

RE: how to statistics categories amount

2008-07-18 Thread Chris Hostetter
: Anyone explain solr's function of facet ,thanks! I gave talk a few years back which goes into some of hte details of doing faceting in Solr. that will give you a starting point, and then looking at the Solr "SimpleFacets" class can fill in the details. http://people.apache.org/~hossma

RE: custom scoring

2008-07-18 Thread Steven A Rowe
Hi Sébastien, Have you looked into the DisjunctionMaxQuery ? From that page: A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum

Re: Scaling

2008-07-18 Thread Jason Rutherglen
RMI search in Lucene uses Searchable.int[] docFreqs(Term[] terms) to obtain the docfreqs for all terms in a the query from each server. Which it then turns into a globalized Weight that is submitted to all the Searchables (servers). Look at MultiSearcher. This is fine for most systems even with

Re: Scaling

2008-07-18 Thread Karl Wettin
18 jul 2008 kl. 09.49 skrev Eric Bowman: One thing I have trouble understanding is how scoring works in this case. Does Lucene really "just work", or are there special things we have to do to make sure that the scores are coherent so we can actually decide which was the best match? What

Re: Interrupting a query

2008-07-18 Thread Grant Ingersoll
True, but I think the approach is similar, in that you need to have the hit collector check to see if your interrupt flag has been set and then exit out. -Grant On Jul 16, 2008, at 9:54 AM, Paul J. Lucas wrote: That has nothing to do with interrupting a query at some arbitrary time. - P

Re: Scaling

2008-07-18 Thread Eric Bowman
Jason Rutherglen wrote: The scaling per machine should be linear. The overhead from the network is minimal because the Lucene object sizes are not impacting. Google mentions in one of their early white papers on scaling http://labs.google.com/papers/googlecluster-ieee.pdf that they have sub ind