Re: Binding lucene instance/threads to a particular processor(or core)

2008-04-22 Thread Anshum
Hi Glen, As far as stats for index/search are concerned, here they are: * Yes, it is a web based application * I am currently facing issues when the number of concurrent searches goes high. The search is not able to handle over 2.5 searches per second. * JVM command line parameters: -server mode;

Re: MoreLikeThis over a subset of documents

2008-04-22 Thread Jonathan Ariel
Smart idea, but it won't help me. I have almost 50 categories and eventually I would like to "filter" not just on category but maybe also on language, etc. Karl: what do you mean by measure the distance between the term vectors and cluster them in real time? On Tue, Apr 22, 2008 at 7:39 PM, Glen N

Re: MoreLikeThis over a subset of documents

2008-04-22 Thread Glen Newton
Sorry, I misunderstood the problem. My mistake. While not optimal and rather expensive space-wise, you could have - in addition to existing keyword field - a field for each category. If the document being indexed is in category A, only add the text to the catA field. Now do MoreLikeThis on catA.

Re: MoreLikeThis over a subset of documents

2008-04-22 Thread Jonathan Ariel
I could have up to 2 million documents and growing. On Tue, Apr 22, 2008 at 7:29 PM, Karl Wettin <[EMAIL PROTECTED]> wrote: > Jonathan Ariel skrev: > > Is there any way to execute a MoreLikeThis over a subset of documents? I > > need to retrieve a set of interesting keywords from a subset of > >

Re: MoreLikeThis over a subset of documents

2008-04-22 Thread Karl Wettin
Jonathan Ariel skrev: Is there any way to execute a MoreLikeThis over a subset of documents? I need to retrieve a set of interesting keywords from a subset of documents and not the entire index (imagine that my index has documents categorized as A, B and C and I just want to work with those categ

Re: MoreLikeThis over a subset of documents

2008-04-22 Thread Jonathan Ariel
But that doesn't help me with my problem, because the interesting terms are taken from the entire index and not a subset as I need. On Tue, Apr 22, 2008 at 6:46 PM, Glen Newton <[EMAIL PROTECTED]> wrote: > Instead of this: > > MoreLikeThis mlt = new MoreLikeThis(ir); > Reader target = ... // orig

RE: Binding lucene instance/threads to a particular processor(or core)

2008-04-22 Thread Renaud Waldura
That's an excellent idea. I would certainely use such an improved MultiSearcher. You should submit a patch. -Original Message- From: Glen Newton [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 22, 2008 10:50 AM To: java-user@lucene.apache.org Subject: Re: Binding lucene instance/threads

Re: MoreLikeThis over a subset of documents

2008-04-22 Thread Glen Newton
Instead of this: MoreLikeThis mlt = new MoreLikeThis(ir); Reader target = ... // orig source of doc you want to find similarities to Query query = mlt.like( target); Hits hits = is.search(query); do this: MoreLikeThis mlt = new MoreLikeThis(ir); Reader target = ... // orig source of doc you want

MoreLikeThis over a subset of documents

2008-04-22 Thread Jonathan Ariel
Is there any way to execute a MoreLikeThis over a subset of documents? I need to retrieve a set of interesting keywords from a subset of documents and not the entire index (imagine that my index has documents categorized as A, B and C and I just want to work with those categorized as A). Right now

Re: Lucene standard analyzer internationalization

2008-04-22 Thread Chris Hostetter
: Yes the version of lucene and java are exactly the same on the different : machines. : Infact we unjared lucene and jared it with our jar and are running from the : same nfs mounts on both the machines i didn't do an indepth code read, but a quick skim of StandardTokenizerImpl didn't turn up a

RE: Lucene standard analyzer internationalization

2008-04-22 Thread Steven A Rowe
Hi Prashant, What is the Unicode code point associated with the 3,4,5 character? Steve On 04/22/2008 at 4:45 PM, Prashant Malik wrote: > Yes the version of lucene and java are exactly the same on > the different > machines. > Infact we unjared lucene and jared it with our jar and are > running f

Re: Lucene standard analyzer internationalization

2008-04-22 Thread Prashant Malik
Yes the version of lucene and java are exactly the same on the different machines. Infact we unjared lucene and jared it with our jar and are running from the same nfs mounts on both the machines Also we have tried with lucene2.2.0 and 2.3.1. with the same result . also about the actual string u

RE: Lucene standard analyzer internationalization

2008-04-22 Thread Steven A Rowe
Hi Prashant, On 04/22/2008 at 2:23 PM, Prashant Malik wrote: > We have been observing the following problem while > tokenizing using lucene's StandardAnalyzer. Tokens that we get is > different on different machines. I am suspecting it has something to do > with the Locale settings on individu

Lucene standard analyzer internationalization

2008-04-22 Thread Prashant Malik
HI , We have been observing the following problem while tokenizing using lucene's StandardAnalyzer. Tokens that we get is different on different machines. I am suspecting it has something to do with the Locale settings on individual machines? For example the word 'CÃ(c)sar' is split as 'CÃ

Re: Binding lucene instance/threads to a particular processor(or core)

2008-04-22 Thread Glen Newton
So even if you only have one index, this is the way to go to manage this kind of problem. Looking at the implementation and having used ThreadPoolExecutor (TPE) a lot, I would make the following suggestions for this class so as to better support this particular use case: Better access to the confi

RE: Binding lucene instance/threads to a particular processor(or core)

2008-04-22 Thread Renaud Waldura
> one solution is to set-up a ThreadPoolExecutor[2] with a fixed > number of threads and a limited queue size (use a bound BlockingQueue[3]) Yes, this is precisely how the ConcurrentMultiSearcher works. https://issues.apache.org/jira/browse/LUCENE-423 -Original Message- From: Glen New

Re: FW: Re: Occasional Hang in IndexWriter.close()

2008-04-22 Thread Stu Hood
Hey Mike, Thank you very much for looking into this issue! I originally switched to the SerialMergeScheduler to try and work around this bug: http://lucene.markmail.org/message/awkkunr7j24nh4qj . I switched back to the ConcurrentMergeScheduler yesterday (since I would rather fail quickly due t

RE: Binding lucene instance/threads to a particular processor(or core)

2008-04-22 Thread Renaud Waldura
Anshum: Have you looked into the ConcurrentMultiSearcher? It would have you split your index into N sub-indices, and search each with a dedicated thread. --Renaud -Original Message- From: Anshum [mailto:[EMAIL PROTECTED] Sent: Monday, April 21, 2008 9:10 PM To: java-user@lucene.apache

Re: Binding lucene instance/threads to a particular processor(or core)

2008-04-22 Thread Glen Newton
Anshun, I think I am dealing with an index of similar scale: 6.4 million records, 83 GB index (see [1] for more info) I mistakenly thought from your original posting that you were interested in binding threads to processors for indexing, but it is sounding like you want to do this for searching.

Re: FW: Re: Occasional Hang in IndexWriter.close()

2008-04-22 Thread Michael McCandless
The hang also only happens if you are using SerialMergeScheduler. Stu, one question: was there an interesting reason why you switched back to SerialMergeScheduler? Did you hit an issue with ConcurrentMergeScheduler? Mike Stu Hood <[EMAIL PROTECTED]> wrote: > Hey gang, > > The finally block was

Re: how to query against payload

2008-04-22 Thread Grant Ingersoll
Hmmm, sounds like you need a new Query. I _think_ it could be something as simple as MutliplicativeTermQuery or something like that whereby instead of adding the score of the payload callback, you would multiple. That way, if the document with the term does not have the payload of intere

Re: FW: Re: Occasional Hang in IndexWriter.close()

2008-04-22 Thread Michael McCandless
OK this output was very helpful, thanks! I think I see what's happening here. Basically a merge can sneak in when Lucene doesn't expect it to (on copying a single external segment over), and as a result it never gets scheduled. This happens only with addIndexesNoOptimize, when the index you addi

RE: How to Retrieve Found Term?

2008-04-22 Thread Edwin Lee
Hi Karl, Thanks for the suggestions, i would be glad to contribute back to the project. i'm not too familiar with the inner workings of Lucene though; how does such a functionality feature in a Query implementation? My naive interpretation, when i first got hold of Lucene, is that Query is wha

Re: How to Retrieve Found Term?

2008-04-22 Thread Karl Wettin
I can think of two ways to get your hands on this information, simplest one beeing you creating a filter with the documents that mached your original query and then place new queries on the index with slop, non slop, et c to find out whats what. This will of couse be very expensive and is thus onl

Re: Need addtional info for Field(希望看得懂中文的朋友帮我出出主意)

2008-04-22 Thread Cedric Ho
In that case you may want to index each: Field("Sub","下午去开会","01:02:02"); as a separate document. So your document contains 3 fields 1. title 2. time 3. sub then you can get both title and time by searching the "sub" field. Cedric 2008/4/22 王建新 <[EMAIL PROTECTED]>: > > 谢谢,我只是检索sub,不检索时间,在检索s