Re: How to do refined search based on attributes and never return zero results

2005-12-07 Thread Jeff Rodenburg
Check out Chris Hostetter's methodology for doing this at cnet. http://mail-archives.apache.org/mod_mbox/lucene-java-user/200508.mbox/[EMAIL PROTECTED] This sounds like it matches your requirements. cheers, j On 12/7/05, Ching-Pei Hsing <[EMAIL PROTECTED]> wrote: > > Has anyway solved the foll

Re: words with more than 1 hyphen ?

2005-12-07 Thread Erik Hatcher
On Dec 7, 2005, at 9:08 PM, Beady Geraghty wrote: In general, do the rules in javaCC work pretty well. In general, all answers would be too general to be useful :) JavaCC is great - I'm using it for a custom query parser myself. But it's not for the feint of heart. It may be more than you

Re: words with more than 1 hyphen ?

2005-12-07 Thread Beady Geraghty
Hi Erik, Thank you so much for pointing out the error :-) It should have been | )+"-"()+("-"()+)*> I missed a pair of brackets for the 3rd LETTER (and a +) I wonder how my indexer and query parser worked before, but not the token stream. Anyhow, it seems to work with both indexing/query parsin

How to do refined search based on attributes and never return zero results

2005-12-07 Thread Ching-Pei Hsing
Has anyway solved the following problem, or have good suggestions? Each document is assigned to one or more category nodes in a hierarchy. For example, Document1: /Computer/Desktop, Document2: /Computer/Notebook; /Salesforce/ExtremePortable Document3: /Computer/Server .. For eac

Re: Confused about boolean query and how an IndexReader is associated with Hits

2005-12-07 Thread Alan Chandler
On Wednesday 07 Dec 2005 22:23, Chris Hostetter wrote: ... > -- the real issue is that your query should matches a certain set of > documents, if there is a document you've added to the index that you > expect to see in that result but isn't there, then use Luke or > something like it to verify: >

Re: Confused about boolean query and how an IndexReader is associated with Hits

2005-12-07 Thread Chris Hostetter
: Unfortunately, my current test is a lot more convoluted than that (because : things are in layers). I will try and break it down like you have done into : a flat form and see were I get to, but its going to take a little time to do. : : I think one of the things I am doing is inserting and dele

Re: Confused about boolean query and how an IndexReader is associated with Hits

2005-12-07 Thread Alan Chandler
On Wednesday 07 Dec 2005 21:33, Chris Hostetter wrote: > > Just doing a cut/paste inline is fine (the mailing list software doesn't > like most attachments). Here's an example of what you're talking about > that seems to work just fine for me... > > > public void sampleTest() throws Exception

Re: Confused about boolean query and how an IndexReader is associated with Hits

2005-12-07 Thread Chris Hostetter
: -ID:0 Category:Category1 Category:Category2 : : What I hope this says is : : "Give me all documents whose ID is not "0" AND which have a Category Field : which contains "Category1" or "Category2" That's what you've got. If it's not matching what you expect it to, then i'm guessing your index

Re: Lucene performance bottlenecks

2005-12-07 Thread Doug Cutting
Andrzej Bialecki wrote: It's nice to have these couple percent... however, it doesn't solve the main problem; I need 50 or more percent increase... :-) and I suspect this can be achieved only by some radical changes in the way Nutch uses Lucene. It seems the default query structure is too compl

Re: Confused about boolean query and how an IndexReader is associated with Hits

2005-12-07 Thread Alan Chandler
On Wednesday 07 Dec 2005 19:00, Chris Hostetter wrote: > : In otherwords my BooleanQuery was more complex than I let on. > > I believe I understand what you are saying, but it's a little hard to make > sense of given teh corrections and lack of context -- it sounds like it > shoudl work. please pr

Re: words with more than 1 hyphen ?

2005-12-07 Thread Erik Hatcher
O 1. I modified the StandardTokenizer.jj file. Essentially, I added the following to StandardTokenizer.jj | )+"-"()+("-")*> Is that the only change you made to the .jj file? Where did you put that exactly? Don't you need a * after the second ? 4. I was able to index and retrieve wo

Re: Lucene performance bottlenecks

2005-12-07 Thread Andrzej Bialecki
Yonik Seeley wrote: if (b>0) return b; Doing an 'and' of two bytes and checking if the result is 0 probably requires masking operations on >8 bit processors... Sometimes you can get a peek into how a JVM would optimize things by looking at the asm output of the code from a C compiler. Bot

Re: Confused about boolean query and how an IndexReader is associated with Hits

2005-12-07 Thread Chris Hostetter
: In otherwords my BooleanQuery was more complex than I let on. I believe I understand what you are saying, but it's a little hard to make sense of given teh corrections and lack of context -- it sounds like it shoudl work. please print out query.toString() and make sure it looks like what you a

Re: Lucene performance bottlenecks

2005-12-07 Thread Doug Cutting
Paul Elschot wrote: Querying the host field like this in a web page index can be dangerous business. For example when term1 is "wikipedia" and term2 is "org", the query will match at least all pages from wikipedia.org. Note that if you search for wikipedia.org in Nutch this is interpreted as a

words with more than 1 hyphen ?

2005-12-07 Thread Beady Geraghty
I am back to doing something with Lucene after a short break from it. I am trying to index/search hyphenated words, and retrieve them from a token stream. 1. I modified the StandardTokenizer.jj file. Essentially, I added the following to StandardTokenizer.jj | )+"-"()+("-")*> 2. I used Java

Re: Confused about boolean query and how an IndexReader is associated with Hits

2005-12-07 Thread Alan Chandler
On Wednesday 07 Dec 2005 07:38, Alan Chandler wrote: > I am trying to construct, via individual query api, a query to search for > documents with a field name of "Category" and a value of either "Category1" > OR "Category2" (or both). > > My code to do this (given categories is the set of strings w

Re: Lucene performance bottlenecks

2005-12-07 Thread Yonik Seeley
> if (b>0) return b; > Doing an 'and' of two bytes and checking if the result is 0 probably > requires masking operations on >8 bit processors... Sometimes you can get a peek into how a JVM would optimize things by looking at the asm output of the code from a C compiler. Both (b>=0) and ((b&0x80)!

Re: Lucene performance bottlenecks

2005-12-07 Thread Yonik Seeley
On 12/7/05, Vanlerberghe, Luc <[EMAIL PROTECTED]> wrote: > Since 'byte' is signed in Java, can't the first test be simply written > as > if (b>0) return b; > Doing an 'and' of two bytes and checking if the result is 0 probably > requires masking operations on >8 bit processors... Yep, that was my

RE: Lucene performance bottlenecks

2005-12-07 Thread Vanlerberghe, Luc
Since 'byte' is signed in Java, can't the first test be simply written as if (b>0) return b; Doing an 'and' of two bytes and checking if the result is 0 probably requires masking operations on >8 bit processors... Also perhaps change to int b=readByte()) so that all operators use int's... Luc --

Re: Lucene performance bottlenecks

2005-12-07 Thread Yonik Seeley
I checked out readVInt() to see if I could optimize it any... For a random distribution of integers <200 I was able to speed it up a little bit, but nothing to write home about: old newpercent Java14-client : 13547 12468 8% Java14-server: 6047 5266 14% Java1

Re: Confused about boolean query and how an IndexReader is associated with Hits

2005-12-07 Thread Erik Hatcher
On Dec 7, 2005, at 9:56 AM, Alan Chandler wrote: Erik Hatcher writes: On Dec 7, 2005, at 7:06 AM, Alan Chandler wrote: Erik Hatcher writes: On Dec 7, 2005, at 2:38 AM, Alan Chandler wrote: Worse than that, when I attempt to access Hits.doc(0) I am getting an immediate IOException with the

Re: Confused about boolean query and how an IndexReader is associated with Hits

2005-12-07 Thread Alan Chandler
Erik Hatcher writes: ... or use IndexReader to navigate to it. That is something I wanted to ask about IndexReader.TermPositions(Term t) Returns an object which returns all occurrences of term. Is that what I use to find the actual position in my documents of the seach item? -- Alan

Re: repeating fields

2005-12-07 Thread Erik Hatcher
On Dec 7, 2005, at 8:48 AM, Reza Ghaffaripour wrote: I think having different documents will not be a good idea. for me each xml is an ebook. and "p" means paragraph. i have hundereds of paragraphs in every ebook. and i think i should keep each ebook in a single document. am i right ? How

Re: Confused about boolean query and how an IndexReader is associated with Hits

2005-12-07 Thread Alan Chandler
Erik Hatcher writes: On Dec 7, 2005, at 7:06 AM, Alan Chandler wrote: Erik Hatcher writes: On Dec 7, 2005, at 2:38 AM, Alan Chandler wrote: Worse than that, when I attempt to access Hits.doc(0) I am getting an immediate IOException with the message "Bad file descriptor". I think ...

Re: Similarity scores for all docs

2005-12-07 Thread Grant Ingersoll
You can use the HitCollector mechanism to fill your array, but what you are doing is essentially what the Hits object already does, plus it provides caching Eugene Ezekiel wrote: Yes, but what I wanna be able to do is something like, fill an array of say size 100 such that: array[0] = similar

Re: repeating fields

2005-12-07 Thread Malcolm
That's what I have, loads of different tags and (abstract) tags etc in each xml document so a lucene document for each is okay. malcolm - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PRO

Re: repeating fields

2005-12-07 Thread Reza Ghaffaripour
I think having different documents will not be a good idea. for me each xml is an ebook. and "p" means paragraph. i have hundereds of paragraphs in every ebook. and i think i should keep each ebook in a single document. am i right ? On 12/7/05, Malcolm <[EMAIL PROTECTED]> wrote: > > > Firstly you

Re: Non scoring search

2005-12-07 Thread Malcolm
Probably being very naive here but: These are my index details: Location:C:\LuceneDemo\Project6thDec Number of documents in Index: 571 Index Current Version: 2 Last Modified: 1133899684000 The index has not had any deletions. What is: Last Modified: 1133899684000? I thought of indexing a date fo

Re: Confused about boolean query and how an IndexReader is associated with Hits

2005-12-07 Thread Erik Hatcher
On Dec 7, 2005, at 7:06 AM, Alan Chandler wrote: Erik Hatcher writes: On Dec 7, 2005, at 2:38 AM, Alan Chandler wrote: Worse than that, when I attempt to access Hits.doc(0) I am getting an immediate IOException with the message "Bad file descriptor". I think ... You must keep your Index

Re: repeating fields

2005-12-07 Thread Malcolm
Firstly you should obtain LUKE and check everything is layed out correctly in your index. Secondly maybe a Wildcard/prefix query or termquery.for example(termquery): TermQuery heTerm = new TermQuery( new Term("p", "x")); TermQuery sheTerm = new TermQuery( ne

Re: repeating fields

2005-12-07 Thread Erik Hatcher
On Dec 7, 2005, at 3:49 AM, Reza Ghaffaripour wrote: hi all, im new to lucene. i have an xml with repeating tags.something like : x xx xxx I add the "p" field as follows: myDocument.add(Field.Text("p", "x")); myDocument.add(Field.Text("p", "xx")); but when i search for "x" it returns t

Re: Confused about boolean query and how an IndexReader is associated with Hits

2005-12-07 Thread Alan Chandler
Erik Hatcher writes: On Dec 7, 2005, at 2:38 AM, Alan Chandler wrote: Worse than that, when I attempt to access Hits.doc(0) I am getting an immediate IOException with the message "Bad file descriptor". I think ... You must keep your IndexSearcher instance alive and well when working wi

Re: Lucene performance bottlenecks

2005-12-07 Thread Andrzej Bialecki
Paul Elschot wrote: On Saturday 03 December 2005 14:09, Andrzej Bialecki wrote: Paul Elschot wrote: In somewhat more readable layout: +(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 content:term2 title:term2^1.5 host:

Re: Similarity scores for all docs

2005-12-07 Thread Eugene Ezekiel
Yes, but what I wanna be able to do is something like, fill an array of say size 100 such that: array[0] = similarity value of query and doc(0) array[1] = similarity value of query and doc(1) Any idea how to fill this array? Thanks. -- Regards, Eugene Koji Sekiguchi wrote: You can get sco

RE: Similarity scores for all docs

2005-12-07 Thread Koji Sekiguchi
You can get scores by calling Hits.score(). So you should search at first to get Hits object. regards, Koji > -Original Message- > From: Eugene Ezekiel [mailto:[EMAIL PROTECTED] > Sent: Wednesday, December 07, 2005 6:03 PM > To: java-user@lucene.apache.org > Subject: Similarity scores fo

Similarity scores for all docs

2005-12-07 Thread Eugene Ezekiel
Hi, Is there any way to get the similarity scores for each document in the index? I can iterate thru each doc in the index using the IndexReader but not sure how to get the similarity score for that doc. Thanks. -- Regards, Eugene ---

repeating fields

2005-12-07 Thread Reza Ghaffaripour
hi all, im new to lucene. i have an xml with repeating tags.something like : x xx xxx I add the "p" field as follows: myDocument.add(Field.Text("p", "x")); myDocument.add(Field.Text("p", "xx")); but when i search for "x" it returns the first hit only. what should i do ? i want to search fo

New Lucene-based application.

2005-12-07 Thread victorn
kbforge.com is pleased to announce the first public release of "kbforge", a new, completely free, desktop search application specifically designed for software developers. What differentiates kbforge from other desktop search programs, is its ability to assist the user in categorising the info

Re: Confused about boolean query and how an IndexReader is associated with Hits

2005-12-07 Thread Erik Hatcher
On Dec 7, 2005, at 2:38 AM, Alan Chandler wrote: Worse than that, when I attempt to access Hits.doc(0) I am getting an immediate IOException with the message "Bad file descriptor". I think this must be because by that time I have closed the indexSearcher (and therefore the Reader that sat b