Re: How can we know if 2 lucene indexes are same?

2008-09-03 Thread Noble Paul നോബിള്‍ नोब्ळ्
The use case is as follows I have two indexes . One at the master and one at the slave. The user occasionally keeps committing on the master and the delta is replicated everytime. But when the optimize happens the transfer size can be really large. So I am thinking of doing the optimize separatel

Re: getTimestamp method in IndexCommit

2008-09-03 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Wed, Sep 3, 2008 at 2:06 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Noble Paul നോബിള്‍ नोब्ळ् wrote: > >> On Tue, Sep 2, 2008 at 1:56 PM, Michael McCandless >> <[EMAIL PROTECTED]> wrote: >>> >>> Are you thinking this would just fallback to Directory.fileModified on >>> the >>> segment

Re: search for empty field?

2008-09-03 Thread Chris Hostetter
I don't think << category:* >> does what you think it does. category:[* TO *] will find all docs that have any indexed tokens in the category field, so combining that as a prohibited clause with a mandatory MatchAllDocsQuery will give you all docs that don't have anything indexed in the cate

Re: Lucene Memory Leak

2008-09-03 Thread 장용석
In fact, I think that the important reasons are Directory class and Analyzer class. If you don't want IndexSearcher class keep open for the entire life of a web application, you can do it. I think It will not cause memory leak problem. But, Directory and Analyzer classes can cause the problem if th

Re: Similarity percentage between two Strings

2008-09-03 Thread N. Hira
More details may change my opinion (not quite sure how others feel yet), but with the way you've described it so far, it seems like all you need is a basic string matcher: For every message: - if message.subject is found in the pool, then this message is "similar to" the message in the poo

Re: Similarity percentage between two Strings

2008-09-03 Thread Thiago Moreira
- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-03 Thread Jason Rutherglen
Hi Yonik, The SOLR 2 list looks good. The question is, who is going to do the work? I tried to simplify the scope of Ocean as much as possible to make it possible (and slowly at that over time) for me to eventually finish what is mentioned on the wiki. I think SOLR is very cool and was major

Re: Pre-filtering for expensive query

2008-09-03 Thread Matt Ronge
On Sep 3, 2008, at 4:09 PM, Paul Elschot wrote: Op Saturday 30 August 2008 18:22:50 schreef Matt Ronge: On Aug 30, 2008, at 6:13 AM, Paul Elschot wrote: Op Saturday 30 August 2008 03:34:01 schreef Matt Ronge: Hi all, I am working on implementing a new Query, Weight and Scorer that is expens

Re: Pre-filtering for expensive query

2008-09-03 Thread Paul Elschot
Op Saturday 30 August 2008 18:22:50 schreef Matt Ronge: > On Aug 30, 2008, at 6:13 AM, Paul Elschot wrote: > > Op Saturday 30 August 2008 03:34:01 schreef Matt Ronge: > >> Hi all, > >> > >> I am working on implementing a new Query, Weight and Scorer that > >> is expensive to run. I'd like to limit

Re: Realtime Search for Social Networks Collaboration

2008-09-03 Thread Yonik Seeley
On Wed, Sep 3, 2008 at 3:20 PM, Jason Rutherglen <[EMAIL PROTECTED]> wrote: > I am wondering > if there are social networks (or anyone else) out there who would be > interested in collaborating with Apache on realtime search to get it > to the point it can be used in production. Good timing Jason,

Re: Similarity percentage between two Strings

2008-09-03 Thread N. Hira
I don't know how much of this is a Lucene problem, but -- as I'm sure you will inevitably hear from others on the list -- it depends on what your definition of "similar" is. By similar, do you mean: 1. Identical, except for variations in case (upper/lower) 2. Allow 1., but also allow prefix

Similarity percentage between two Strings

2008-09-03 Thread Thiago Moreira
- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: search for empty field?

2008-09-03 Thread Chris Lu
I was kind of waiting for a more efficient solution based on TermDocs/TermEnum, but I feel since the term is not there at all, the only thing we can do is to do some deduction. I can copy the bitmap of all the deleted docs, and go through all the TermDocs/TermEnum, and set the bit if there is a ter

Realtime Search for Social Networks Collaboration

2008-09-03 Thread Jason Rutherglen
Hello all, I don't mean this to sound like a solicitation. I've been working on realtime search and created some Lucene patches etc. I am wondering if there are social networks (or anyone else) out there who would be interested in collaborating with Apache on realtime search to get it to the poi

Re: search for empty field?

2008-09-03 Thread Erick Erickson
Oh.. I wonder if TermDocs/TermEnum would work for you instead. Would it work to just create a document validator at index time that threw an exception if all required fields weren't present? Or is that outside your control? Best Erick On Wed, Sep 3, 2008 at 3:11 PM, Chris Lu <[EMAIL PROTECTE

Re: search for empty field?

2008-09-03 Thread Chris Lu
Thanks Erick for reminding me of this! I only need to validate a index and make sure the content are correctly retrieved and index doesn't have empty fields. So I'd better simply go through all document by id and check them directly. Thanks! -- Chris Lu - Instant Scalable

Re: search for empty field?

2008-09-03 Thread Erick Erickson
This has been discussed multiple times, so looking at the searchable archive will give you more detailed info. But as I remember, the consensus suggestion was to index some "impossible" value for those documents that lack a field. For instance, say your field was "sometimes". I document that had no

Re: concise definition of Lucene score?

2008-09-03 Thread Chris Hostetter
: I have attempted to find a concise definition of how the Lucene score is : calculated, something that can be understood by most people. The answer tends to vary based on exactly what type of query you are talking about ... TermQuery? PhraseQuery? BooleanQuery contianing a mix? I'm going to

Re: Lucene Memory Leak

2008-09-03 Thread Simon Willnauer
If you are looking for a reasonable performance you should not close your IndexSearcher if not necessary. It is actually best practice to leave an IndexSearcher instance open an even share it between threads / requests of your webapplication. The searcher will not pollute your memory. Just keep the

Re: Pre-filtering for expensive query

2008-09-03 Thread Paul Elschot
Op Wednesday 03 September 2008 18:06:57 schreef Matt Ronge: > On Aug 30, 2008, at 3:01 PM, Paul Elschot wrote: > > Op Saturday 30 August 2008 18:19:09 schreef Matt Ronge: > >> On Aug 30, 2008, at 4:43 AM, Karl Wettin wrote: > >>> Can you tell us a bit more about what you custom query does? > >>> Pe

Re: Lucene Memory Leak

2008-09-03 Thread Andy33
I took your advice and created Singletons for the Directory, Analyzer, and IndexSearcher classes. I also undid the closing of the Directory and IndexSearcher. This seemed to fix my memory leak problem. However, I don't like the fact that I am leaving open the IndexSearcher for the entire life of a

Re: Pre-filtering for expensive query

2008-09-03 Thread Grant Ingersoll
On Aug 30, 2008, at 3:14 PM, Andrzej Bialecki wrote: Matt Ronge wrote: Hi all, I am working on implementing a new Query, Weight and Scorer that is expensive to run. I'd like to limit the number of documents I run this query on by first building a candidate set of documents with a boolean

Re: Pre-filtering for expensive query

2008-09-03 Thread Matt Ronge
On Aug 30, 2008, at 3:01 PM, Paul Elschot wrote: Op Saturday 30 August 2008 18:19:09 schreef Matt Ronge: On Aug 30, 2008, at 4:43 AM, Karl Wettin wrote: Can you tell us a bit more about what you custom query does? Perhaps you can build the "candidate filter" and reuse it over and over again?

Re: concise definition of Lucene score?

2008-09-03 Thread Grant Ingersoll
What's not concise about a complex math formula? :-) The basic Term Vector approach to IR, that Lucene more or less implements, says that the score for a document given a query is the cosine of the angle formed between the query vector and the document vector. I like to draw a standard x

concise definition of Lucene score?

2008-09-03 Thread Jon Loken
Hi all, I have attempted to find a concise definition of how the Lucene score is calculated, something that can be understood by most people. The information I found is accurate, but not particularly concise. http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apac he/lucene/se

Re: getTimestamp method in IndexCommit

2008-09-03 Thread Michael McCandless
Noble Paul നോബിള്‍ नोब्ळ् wrote: On Tue, Sep 2, 2008 at 1:56 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: Are you thinking this would just fallback to Directory.fileModified on the segments_N file for that commit? You could actually do that without any API change, because IndexComm

Re: getTimestamp method in IndexCommit

2008-09-03 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Tue, Sep 2, 2008 at 1:56 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Are you thinking this would just fallback to Directory.fileModified on the > segments_N file for that commit? > > You could actually do that without any API change, because IndexCommit > exposes a getSegmentsFileName(