Frequency of phrase

2006-02-23 Thread Eric Jain
This is somewhat related to a question sent to this list a while ago: Is there an efficient way to count the number of occurrences of a phrase (not term) in an index? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional c

Re: How can I get a term's frequency?

2006-02-23 Thread Grant Ingersoll
You need to make sure you are indexing with Term Vectors in order for IndexReader.getTermFreqVector to return anything meaningful. You do not need to implement it. QueryTermVector is meant to provide similar information to the Document side for Queries. For an example demo of indexing and using t

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Peter Keegan
Hi Otis, The Lucene server is actually CPU and network bound, as the index gets memory mapped pretty quickly. There is little disk activity observed. I was also able to run the server on a Sun box last night with 4 dual core opterons (same Linux and JVM) and I'm observing query rates of 400 qps!

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Raghavendra Prabhu
Can nutch be made to use lucene query parser? Rgds Prabhu On 2/23/06, Peter Keegan <[EMAIL PROTECTED]> wrote: > > Hi Otis, > > The Lucene server is actually CPU and network bound, as the index gets > memory mapped pretty quickly. There is little disk activity observed. > > I was also able to run

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Dan Armbrust
I would give the IBM or blackdown JVM a try on linux - I've seen pretty wide variance in their speed on different operations. Sometimes better than Sun, sometimes worse - it depended on the task (I did some adhoc tests at one point that showed sun was faster for indexing, but IBM was faster fo

SQL DISTINCT functionality in Lucene

2006-02-23 Thread Hugh Ross
Hi, I need to find all distinct values for a keyword field in a Lucene index. Is this easily done? If so how? Many thanks, Hugh

Hierarchical Navigation in Lucene

2006-02-23 Thread Hugh Ross
Hi, We have a custom built document repository which is searchable / indexed via lucene. I want to put together some kind of navigation framework based on the repository metadata (which is also indexed with lucene). Is there a best-practice way to do this.? Thanks, Hugh

Re: SQL DISTINCT functionality in Lucene

2006-02-23 Thread Michael D. Curtin
Hugh Ross wrote: I need to find all distinct values for a keyword field in a Lucene index. I think the IndexReader.terms() method will do what you want. Good luck! --MDC - To unsubscribe, e-mail: [EMAIL PROTECTED] For addi

RE: SQL DISTINCT functionality in Lucene

2006-02-23 Thread Hugh Ross
Many Thanks. Hugh -Original Message- From: Michael D. Curtin [mailto:[EMAIL PROTECTED] Sent: 23 February 2006 17:39 To: java-user@lucene.apache.org Subject: Re: SQL DISTINCT functionality in Lucene Hugh Ross wrote: > I need to find all distinct values for a keyword field in a Lucene i

RE: search a subdirectory (New to Lucene)

2006-02-23 Thread John Hamilton
I reindexed with the path as a keyword field and now the PrefixQuery filter does exactly what I need. Thanks! I'm going to hold off on the paragraph-level indexing for now, but that does sound interesting. many thanks, John -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECT

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Otis Gospodnetic
Hi, Please ask on the Nutch mailing list (I answered your question in general@ already). Also, please don't steal other people's threads - it's considered inpolite for obvious reasons. Otis - Original Message From: Raghavendra Prabhu <[EMAIL PROTECTED]> To: java-user@lucene.apache.or

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Raghavendra Prabhu
Hi Sorry for the trouble I was sending my first mail to the group and replied to this thread and then later on sent a direct mail. I would like to apologise for the inconvenience caused. Rgds Prabhu On 2/23/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > > Hi, > > Please ask on the Nutch m

about Filttering

2006-02-23 Thread Daniel Cortes
Hi luceners, I have a problem that I don't know what to do. I want to use ISOLatin1AccentFilter that I found In lucene trunks The code in my analyzer is: public final TokenStream tokenStream(String fieldName, Reader reader) { if (fieldName == null) throw new IllegalArgumentException("fiel

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Peter Keegan
We discovered that the kernel was only using 8 CPUs. After recompiling for 16 (8+hyperthreads), it looks like the query rate will settle in around 280-300 qps. Much better, although still quite a bit slower than the opteron. Peter On 2/22/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > Hmmm, n

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Chris Lamprecht
Peter, Have you given JRockit JVM a try? I've seen it help throughput compared to Sun's JVM on a dual xeon/linux machine, especially with concurrency (up to 6 concurrent searches happening). I'm curious to see if it makes a difference for you. -chris On 2/23/06, Peter Keegan <[EMAIL PROTECTED]>

Re: about Filttering

2006-02-23 Thread Erik Hatcher
On Feb 23, 2006, at 1:22 PM, Daniel Cortes wrote: Hi luceners, I have a problem that I don't know what to do. I want to use ISOLatin1AccentFilter that I found In lucene trunks The code in my analyzer is: public final TokenStream tokenStream(String fieldName, Reader reader) { if (fie

Re: Hierarchical Navigation in Lucene

2006-02-23 Thread Erik Hatcher
On Feb 23, 2006, at 12:37 PM, Hugh Ross wrote: Hi, We have a custom built document repository which is searchable / indexed via lucene. I want to put together some kind of navigation framework based on the repository metadata (which is also indexed with lucene). Is there a best-practice

Re: Searching/sorting strategy for many properties for semantic web app

2006-02-23 Thread Erik Hatcher
On Feb 22, 2006, at 9:01 PM, David Pratt wrote: Hi Erik. Many thanks for your reply. I'll likely see if I can find a list to pose a couple of questions there way. I am having fun with Lucene since it is new to me and I am impressed with the speed I am getting. I am reading anything I can ge

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Peter Keegan
Chris, I tried JRockit a while back on 8-cpu/windows and it was slower than Sun's. Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system next (32 with hyperthreading), on LinTel. I may give JRockit another go around then. Thanks, Peter On 2/23/06, Chris Lamprecht <[EMAIL PROTECT

Getting no hits ...

2006-02-23 Thread Mufaddal Khumri
I have been trying to figure out why my query below would not return any hits. I use two custom analyzers for indexing and searching. The one I use for indexing uses this: public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new StandardTokenizer

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Yonik Seeley
Wow, some resources! Would it be cheaper / more scalable to copy the index to multiple boxes and loadbalance requests across them? -Yonik On 2/23/06, Peter Keegan <[EMAIL PROTECTED]> wrote: > Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system next > (32 with hyperthreading), o

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Peter Keegan
Yonik, We're investigating both approaches. Yes, the resources (and permutations) are dizzying! Peter On 2/23/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > Wow, some resources! > Would it be cheaper / more scalable to copy the index to multiple > boxes and loadbalance requests across them? > >

Re: Getting no hits ...

2006-02-23 Thread Chris Hostetter
1) Have you looked at what tokens your indexing analyzer produces when you tokenize "ES-20D" ? 2) Have you looked at what tokens your query analyser products when you tokenize "ES-20D" ? 3) Have you tried a simpler query (ie: just "content:es\-20d" ) ? 4) When giving QueryParser a (quoted) p

Re: Getting no hits ...

2006-02-23 Thread Mufaddal Khumri
In my earlier email i put in the wrong query that I am searching on. The correct query is: EOS-20D And this is the query under question that is producing no hits still: +(+content:eos\-20d) +entity:product +(title:"eos\-20d"~2^40.0 ((title:eos\-20d)^10.0) content:"eos\-20d"~2^20.0 (content:eos

Re: ArrayIndexOutOfBounds being thrown ...

2006-02-23 Thread Stephen Gray
Hi everyone, Sorry for not replying to original post (from Muffadal Khumri, 22/2) - I'm new to the list. I also had this problem, but it seems not to be in the source - downloading and building the1.9-rc1 source fixed the problem for me. Steve Stephen Gray Archive Research Officer Austral

Re: Getting no hits ...

2006-02-23 Thread Mufaddal Khumri
Follow up on my previous email ... When I execute this query using luke using the standard analyzer on the same index, i get 8 hits. +(+content:eos\-20d) +entity:product +(title:"eos\-20d"~2^40.0 ((title:eos\-20d)^10.0) content:"eos\-20d"~2^20.0 (content:eos\-20d) categoryName:"eos\-20d"^80.0)

phrase frequency??

2006-02-23 Thread sog
I searched my question in the mail archive, and found that I really want to get a phrase frequency, it is an old question which was not solved well. I traced Lucene source code, and discover that I can get a phrase's IDF from the Hits object weight= PhraseQuery$PhraseWeight (id=62) idf= 8.

Re: Searching/sorting strategy for many properties for semantic web app

2006-02-23 Thread David Pratt
Thanks Erik. I am continuing to experiment and making good progress. I have got my basic functionality established and am now looking at sorting and ranking. I guess the good thing is I can adjust and modify things as I learn more. I am reading some archived material from the list as well to g

Re: Frequency of phrase

2006-02-23 Thread Dave Kor
Not sure if this is what you want, but what I have done is to issue exact phrase queries to Lucene and counted the number of hits found. On 2/23/06, Eric Jain <[EMAIL PROTECTED]> wrote: > This is somewhat related to a question sent to this list a while ago: Is > there an efficient way to count the