index concurrency & result order

2006-01-27 Thread kate
hi list, i'm trying to use Lucene (1.4.3) to replace an existing MySQL search system. so far, this is working great, but i have a couple of questions. firstly, when my index updater is (re)indexing a lot of documents at once, i often get errors like "FileNotFoundException: /usr/local/searchin

Re: How does the lucene normalize the score?

2006-01-27 Thread Chris Hostetter
: ..but this means, that the scores are not comparable across queries, : because a hit with the score '0.7' from one query mustn't be as 'good' as : a '0.7' from another query...and this is only the case, whether the original, : unnormalized top score value was less than 1.0. Scores are not compa

RE: Help with indexing and query strategy

2006-01-27 Thread Colin Young
1) Yes. One location per document. 2) Using the SimpleAnalyzer (for now). I have city, state and country as separate fields, so I could tokenize each as a single token if that would work better. I think that avoids the need for a delimiter at index time. 3) I am not making any assumptions now at

RE: problem updating a document: no segments file?

2006-01-27 Thread John Powers
this code works in a couple other boxes as is.that deleting code removes the active index after this one builds in a different location. then the searcher is told to make this newest one the current and the old one is deleted. it effects directories and their entire contents. it would

RE: problem updating a document: no segments file?

2006-01-27 Thread Chris Hostetter
: Its still not keeping the segments file around. Is that necessary? You seem to have some code at the end that (i'm guess) is supposed to remove older copies of the index. Are you sure that code does what you think it does? Have you tried commenting it out and seeing if that fixes your pro

Re: How to find "function()" - ?

2006-01-27 Thread Michael D. Curtin
Dmitry Goldenberg wrote: Hi, I'm trying to figure out a way to locate tokens which include special characters. The actual text in the file being indexed is something like "function() { statement1; statement2; }" The query I'm using is "function\()" since I want to locate precisely "function

How to find "function()" - ?

2006-01-27 Thread Dmitry Goldenberg
Hi, I'm trying to figure out a way to locate tokens which include special characters. The actual text in the file being indexed is something like "function() { statement1; statement2; }" The query I'm using is "function\()" since I want to locate precisely "function()" - the query succeeds

Re: Help with indexing and query strategy

2006-01-27 Thread Rajesh Munavalli
Few questions. (1) Does each document contain only one geographical location? (2) Given a document, how are you tokenizing it into city, state and country? I am assuming "," as the delimiter here. Otherwise determining the boundary for names like "St. Louis du Ha Ha" would be difficult. (3) Are t

RE: Help with indexing and query strategy

2006-01-27 Thread Colin Young
The reason I only want 2 hits is because [2] is more "specific" in my domain -- I could also have Toronto, Ontario; Kingston, Ontario etc. which would take the hits up to 5 now. What I'm really after is finding a way to index and search that would make [2] an invalid retrieval. My latest attempt

Re: Help with indexing and query strategy

2006-01-27 Thread Rajesh Munavalli
Hi Colin, Even assuming you came up with a good way of indexing, the example query "Ontario, CA" should yield 3 hits. All 2, 3 and 4 are valid retrievals. Could you please justify which 2 hits you want and why? Thanks, Rajesh Munavalli Colin Young wrote: I'm having some trouble comi

Re: Help with indexing and query strategy

2006-01-27 Thread Rajesh Munavalli
Hi Colin, Even assuming you came up with a good way of indexing, the example query "Ontario, CA" should yield 3 hits. All 2, 3 and 4 are valid retrievals. Could you please justify which 2 hits you want and why? Thanks, Rajesh Munavalli On 1/27/06, Colin Young <[EMAIL PROTECTED]> wrote: >

Re: [SPAM] - Re: Performance tips? - Sending mail server found on bl.spamcop.net

2006-01-27 Thread Doug Cutting
Daniel Pfeifer wrote: Are we both talking about Lucene? I am using Lucene 1.4.3 and can't find a class called MapDirectory or MMapDirectory. It is post-1.4. You can download a nightly build of the current trunk at: http://cvs.apache.org/dist/lucene/java/nightly/ Doug ---

Help with indexing and query strategy

2006-01-27 Thread Colin Young
I'm having some trouble coming up with a good search strategy for geographical data. e.g., given: [1] city: London, United Kingdom [2] city: London, Ontario, Canada [3] city: Ontario, California, United States [4] state: Ontario, Canada [5] city: Vancouver, Washington, United States [6] city: Va

RE: problem updating a document: no segments file?

2006-01-27 Thread John Powers
The lucene info is: Manifest-Version: 1.0 Ant-Version: Apache Ant 1.6.1 Created-By: Apache Jakarta Name: org/apache/lucene Specification-Title: Lucene Search Engine Specification-Version: 1.4.3 Specification-Vendor: Lucene Implementation-Title: org.apache.lucene Implementation-Version: build 2004-

Re: How does the lucene normalize the score?

2006-01-27 Thread Yonik Seeley
On 1/27/06, Chris Lamprecht <[EMAIL PROTECTED]> wrote: > Actually, I just looked at the code, and it actually does this by > taking 1/maxScore and then multiplying this by each score (equivalent > results in the end, maybe more efficient(?)). Very much so... fdiv commonly takes 20 to 40 clock cycl

Re: How does the lucene normalize the score?

2006-01-27 Thread duiduder
..but this means, that the scores are not comparable across queries, because a hit with the score '0.7' from one query mustn't be as 'good' as a '0.7' from another query...and this is only the case, whether the original, unnormalized top score value was less than 1.0. Looks this really like a fea

RE: [SPAM] - Re: Performance tips? - Sending mail server found on bl.spamcop.net

2006-01-27 Thread Daniel Pfeifer
Are we both talking about Lucene? I am using Lucene 1.4.3 and can't find a class called MapDirectory or MMapDirectory. /Daniel -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: den 27 januari 2006 11:43 To: java-user@lucene.apache.org Subject: [SPAM] - Re: Performance

Re: encoding

2006-01-27 Thread John Haxby
petite_abeille wrote: I would love to see this. I presently have a somewhat unwieldy conversion table [1] that I would love to get ride of :)) [snip] [1] http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt I've attached the perl script -- feed http://www.unicode.org/Public/4.1.0/u

Re: Performance tips?

2006-01-27 Thread Doug Cutting
Daniel Pfeifer wrote: We are sporting Solaris 10 on a Sun Fire-machine with four cores and 12GB of RAM and mirrored Ultra 320-disks. I guess I could try switching to FSDirectory and hope for the best. Or, since you're on a 64-bit platform, try MMapDirectory, which supports greater parallelism

Re: How does the lucene normalize the score?

2006-01-27 Thread xing jiang
hi, thank you for your help. On 1/27/06, Chris Lamprecht <[EMAIL PROTECTED]> wrote: > > It takes the highest scoring document, if greater than 1.0, and > divides every hit's score by this number, leaving them all <= 1.0. > Actually, I just looked at the code, and it actually does this by > takin

Re: Getting the document number (with IndexReader)

2006-01-27 Thread Paul Elschot
On Friday 27 January 2006 02:36, Chun Wei Ho wrote: > Thanks for the info :) One last related question. > > If I delete documents using a IndexReader(), can I assume that the > internal document numbers of other undeleted documents (obtained using > the same IndexReader instance) will not change u

RE: Performance tips?

2006-01-27 Thread Daniel Pfeifer
Well, We are sporting Solaris 10 on a Sun Fire-machine with four cores and 12GB of RAM and mirrored Ultra 320-disks. I guess I could try switching to FSDirectory and hope for the best. -Original Message- From: Chris Lamprecht [mailto:[EMAIL PROTECTED] Sent: den 27 januari 2006 08:50 To:

Re: How does the lucene normalize the score?

2006-01-27 Thread Chris Lamprecht
It takes the highest scoring document, if greater than 1.0, and divides every hit's score by this number, leaving them all <= 1.0. Actually, I just looked at the code, and it actually does this by taking 1/maxScore and then multiplying this by each score (equivalent results in the end, maybe more