Re: ORs and Ranks

2009-01-06 Thread Walt Stoneburner
Erick, Thanks for taking a moment to address my question. I suspect the confusion expressed in the answer was from a slight transcription error that added additional punctuation. In your reply, the query was expressed using fields (note the use of extra use of colons that changes the query m

ORs and Ranks

2009-01-05 Thread Walt Stoneburner
Got an interesting question about Lucene's behavior, as recently I was handed something that look like this: ( +MEDICAL CAT^2 ) OR ( +ANIMAL CAT^-2 ) The intention of the query is to say "if medical is found, then rank cat [scans] high, but if animal is found then rank cat [a feline] low." Pr

Re: Rank based on lists.

2007-08-14 Thread Walt Stoneburner
up in all three queries, this has substantial meaning to me, and it now becomes the most important document. I'd ideally like to have the union of all the query results returned, but with my document ranked at the top. I'm getting this sinking feeling of post-processing a return

Rank based on lists.

2007-08-13 Thread Walt Stoneburner
a match of X Y, all of list two, would rank less than a match of A X, which had hits in both. Is this even possible, or does Lucene not have facilities for lists, sets, groups, or whatever makes sense to call them? -Walt Stoneburner, [EMAIL PROTECTED] http://www.wwc

Standard Analyzer Escapes

2007-07-13 Thread Walt Stoneburner
whoa, not expecting that. AAA&&BBB&&CCC&&DDD becomes aaa bbb ccc ddd ...if && means AND, ok... AAA\&&BBB\&&CCC\&&DDD no change aaa bbb ccc ddd AAA\&\&BBB\&\&am

Auto Slop

2007-07-03 Thread Walt Stoneburner
the sequence of tokens. Hope this helps someone else in the future, -Walt Stoneburner, http://www.wwco.com/~wls/blog/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Reusing Document Objects (was Auto Slop)

2007-07-02 Thread Walt Stoneburner
If I create a Document object, can I pass it to multiple index writers without harm? Or, does the process of being handed to an Index Writer somehow mutate the state of the Document object, say during tokenizing, that would cause it's re-use with a totally separate index to cause problems ...such

Auto Slop

2007-07-02 Thread Walt Stoneburner
I just ran into an interesting problem today, and wanted to know if it was my understanding or Lucene that was out of whack -- right now I'm leaning toward a fault between the chair and the keyboard. I attempted to do a simple phrase query using the StandardAnalyzer: "United States" Against my c

RE: Several questions about scoring/sorting + random sorting in an image/related application

2007-06-15 Thread Walt Stoneburner
Antoine Baudoux writes: I want to be able to give a score to each collection. Keep in mind, Lucene is computing a score based on quite a number of things from how often a term is used in a document, how often it appears in the collection of documents, how long the query is, etc. If your concep

The values which compute scores. - Part II

2007-06-01 Thread Walt Stoneburner
I've managed to build my own Similarity class, plug it in, and use Explain to convince myself that I am, indeed, getting the correct weightings that I desire. My test case documents are yielding precisely the intermediate values needed for alternate scoring. There's just one thing... When I do

Re: The values which compute scores.

2007-05-31 Thread Walt Stoneburner
Grant writes: One question that comes to mind, is what are you looking to do? What I'm trying to do is prevent Lucene from providing better ranking for documents that use a term multiple times than those that have more term hits. I've got some huge queries with quite a number of unique terms.

The values which compute scores.

2007-05-30 Thread Walt Stoneburner
Hopefully I'm not opening myself up to public ridicule with what may be a very stupid question, but... At the moment, I'm trying to wrap my head around some of the math that happens when Lucene does scoring. Let's put aside the big equation for a moment and focus on a simple method, such as tf()

Re: Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

2007-05-25 Thread Walt Stoneburner
In reading the math for scoring at the bottom of: http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html It appears that if I can make tf() and idf(), term frequency and inverse document frequency respectively, both return 1, then coord(), w

Re: Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

2007-05-25 Thread Walt Stoneburner
Grant writes: Have a look at the DisjunctionMaxQuery, I think it might help, although I am not sure it will fully cover your case. The definition for DisjunctionMaxQuery is provided at this URL: http://incubator.apache.org/lucene.net/docs/2.1/Lucene.Net.Search.DisjunctionMaxQuery.html, Grossly

Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

2007-05-24 Thread Walt Stoneburner
Hi, I'm trying to figure what I need to do with Lucene to score a document higher when it has a larger number of unique search terms that are hit, rather than term frequency counts. A quick example. If I'm searching for "BIRD CAT DOG" (all should clauses), then I want ...a document with B

Re: Mixing Case and Case-Insensitive Searching

2007-05-12 Thread Walt Stoneburner
modifications to the parser anyhow. I also wasn't too hot on further overloading the meaning of a symbol. Thanks for the feedback. I hope that at least knowing it IS possible to do helps some poor soul. -Walt Stoneburner, http://www.wwco.com/~wls/blog/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Mixing Case and Case-Insensitive Searching

2007-05-11 Thread Walt Stoneburner
sample documents, ingesting, and inspecting them with Luke, while stepping through source looking at generated queries, it all seems to be working perfectly. I'd love to see a formal syntax like this officially enter the Lucene standard query language someday. If someone can figure point me at how to do this without twiddling Lucene's code directly, I'd be happy to contribute the modification. -Walt Stoneburner http://www.wwco.com/~wls/blog/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Proximity searching with subexpressions

2007-05-09 Thread Walt Stoneburner
Surprisingly, I didn't see a difference between a simple case like "A B" and "A B"~10 in the explanation. Is this a problem with Luke, or did I possibly miss something trivial? Anyhow, the more important question is: I

Mixing Case and Case-Insensitive Searching

2007-04-17 Thread Walt Stoneburner
do things like: "LET organization" (where LET is case sensitive, but part of the phrase) "company LET"~10 (again, where LET is case sensitive, near the term company which is case insensitive) Would love to get some thoughts on how t

Re: Standard Parser Behavior

2007-04-11 Thread Walt Stoneburner
Mike Klaas elaborates on syntax: +(-A +B) -> must match (-A +B) -> must contain B and must not contain A -(-A +B) -> must not match (-A +B) -> must not (match B and not contain A) Ok, the take-away from this I'm getting is that these clauses read very much like English and behave just the same.

Re: Standard Parser Behavior

2007-04-10 Thread Walt Stoneburner
Steven Parkes points out: Lucene doesn't use a pure Boolean algebra, so things don't always do what one might expect and things like De Morgan's law don't hold. You're exactly on to what I was pondering about. With boolean logic, I understand the operators inside and out, so something like De

Re: Standard Parser Behavior

2007-04-09 Thread Walt Stoneburner
Otis Gospodnetic <[EMAIL PROTECTED]> responded to Walt Stoneburner: Purely negative queries don't work. Example: -A will not find all documents that do not have "A". What I'm trying to do is augment an existing query by appending qualifiers. If I search for +HORSE

Standard Parser Behavior

2007-04-08 Thread Walt Stoneburner
logic at all, but set operations. A co-worker of mine came up with an interesting syntax, and I had no idea what it meant either: +( -A -B ) ...which to him it meat "must have no A and no B". Can anyone clarify how + and - work on groups, and if the above has any coherent meaning? -Walt Stoneburner

Search vs. Rank

2007-03-14 Thread Walt Stoneburner
Most search engine technologies return result sets based some weighted frequency of the search terms found. I've got a new problem, I want to rank by different criteria than I searched for. For example, I might want to return as my result set all documents that contain the word pizza, but rank t

Finding matched terms

2007-03-13 Thread Walt Stoneburner
When performing a query and getting a result set back, if one wants to know which terms from the query actually matched, is Highlighter still the best way to go with the latest Lucene, or should I start looking at query term frequency vectors? Just trying to find a non-expensive way of doing this

Index a source, but not store it... can it be done?

2007-03-08 Thread Walt Stoneburner
Have an interesting scenario I'd like to get your take on with respect to Lucene: A data provider (e.g. someone with a private website or corporately shared directory of proprietary documents) has requested their content be indexed with Lucene so employees can be redirected to it, but provisional

Re: Soliciting Design Thoughts on Date Searching

2007-03-05 Thread Walt Stoneburner
Erick / Steve, Thank you both (as well as everyone else who weighed in) on helping get to a far more optimal solution well before any code was ever slung. Since we all know that someone else is going to find this in the archives some day, I'd like to unveil the rest of my ignorance and misconc

Re: Soliciting Design Thoughts on Date Searching

2007-03-01 Thread Walt Stoneburner
Thank you all for the suggestions steering me down the right path. As an aside, the easy part, at least for me, is extracting the dates -- Peter was dead on about how doing that: heuristics, multiple regular expressions, and data structures. As Steve pointed out, this isn't as trivial as it soun

Re: Soliciting Design Thoughts on Date Searching

2007-02-28 Thread Walt Stoneburner
Been searching http://www.gossamer-threads.com/lists/lucene/java-user/ as Erick suggested; man, is there a wealth of information in the Lucene archives. I have found many examples of how to convert text to dates and back, how to search Date fields for various ranges, and so forth -- but I don't t

Soliciting Design Thoughts on Date Searching

2007-02-27 Thread Walt Stoneburner
;t paint myself into a corner later? Thanks, -Walt Stoneburner - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Walt Stoneburner
On 1/10/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: I'm guessing there is suppose to be some sort of table structure to the mail you send ... it doesn't work in plain text mail readers so i'm not sure whta ou were trying to say. My bad, I was using GMail, and it was trying to produce a ver

Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Walt Stoneburner
* A B 1 2 2 1 0 0 0 1* 0* Non-zero results are returned to the user. *Walt Stoneburner <[EMAIL PROTECTED]> 10-Jan-2007 v1.0* -wls

Re: Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Walt Stoneburner
On 1/10/07, Mark Miller <[EMAIL PROTECTED]> wrote: The subtle part is that a scoring system is being used that operates in something of a boolean fashion, but that has subtle difference. Mark, -thank you-. This explains it beautifully. So, if I understand you right, a simple query of NOT OR

Getting a Better Understanding of Lucene's Search Operators

2007-01-10 Thread Walt Stoneburner
what's the difference in behavior? The fact that the documentation calls out these operators separately, gives them their own unique names, and describes them in different terms is enough to make me think something very important or very subtle is going on. If anyone could

Understanding Lucene Slop

2006-07-20 Thread Walt Stoneburner
he word "ships" and "fate" must be next to each other, but the order is unimportant, and that a slop of 7 or more would identify the first sentence (and also the second). Is there a way to direct Lucene that the order is, or is not, important? (e.g.