Erick,
Thanks for taking a moment to address my question. I suspect the
confusion expressed in the answer was from a slight transcription error that
added additional punctuation.
In your reply, the query was expressed using fields (note the use of extra
use of colons that changes the query m
Got an interesting question about Lucene's behavior, as recently I was
handed something that look like this:
( +MEDICAL CAT^2 ) OR ( +ANIMAL CAT^-2 )
The intention of the query is to say "if medical is found, then rank cat
[scans] high, but if animal is found then rank cat [a feline] low."
Pr
up in all three queries,
this has substantial meaning to me, and it now becomes the most
important document.
I'd ideally like to have the union of all the query results
returned, but with my document ranked at the top.
I'm getting this sinking feeling of post-processing a return
a match of X Y, all of list two, would rank less
than a match of A X, which had hits in both.
Is this even possible, or does Lucene not have facilities for lists,
sets, groups, or whatever makes sense to call them?
-Walt Stoneburner, [EMAIL PROTECTED]
http://www.wwc
whoa, not expecting that.
AAA&&BBB&&CCC&&DDD becomes aaa bbb ccc ddd ...if && means AND, ok...
AAA\&&BBB\&&CCC\&&DDD no change aaa bbb ccc ddd
AAA\&\&BBB\&\&am
the sequence
of tokens.
Hope this helps someone else in the future,
-Walt Stoneburner,
http://www.wwco.com/~wls/blog/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
If I create a Document object, can I pass it to multiple index writers
without harm?
Or, does the process of being handed to an Index Writer somehow mutate the
state of the Document object, say during tokenizing, that would cause it's
re-use with a totally separate index to cause problems ...such
I just ran into an interesting problem today, and wanted to know if it
was my understanding or Lucene that was out of whack -- right now I'm
leaning toward a fault between the chair and the keyboard.
I attempted to do a simple phrase query using the StandardAnalyzer:
"United States"
Against my c
Antoine Baudoux writes:
I want to be able to give a score to each collection.
Keep in mind, Lucene is computing a score based on quite a number of
things from how often a term is used in a document, how often it
appears in the collection of documents, how long the query is, etc.
If your concep
I've managed to build my own Similarity class, plug it in, and use
Explain to convince myself that I am, indeed, getting the correct
weightings that I desire. My test case documents are yielding
precisely the intermediate values needed for alternate scoring.
There's just one thing...
When I do
Grant writes:
One question that comes to mind, is what are you looking to do?
What I'm trying to do is prevent Lucene from providing better ranking
for documents that use a term multiple times than those that have more
term hits.
I've got some huge queries with quite a number of unique terms.
Hopefully I'm not opening myself up to public ridicule with what may
be a very stupid question, but...
At the moment, I'm trying to wrap my head around some of the math that
happens when Lucene does scoring. Let's put aside the big equation
for a moment and focus on a simple method, such as tf()
In reading the math for scoring at the bottom of:
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html
It appears that if I can make tf() and idf(), term frequency and
inverse document frequency respectively, both return 1, then coord(),
w
Grant writes:
Have a look at the DisjunctionMaxQuery, I think it might help,
although I am not sure it will fully cover your case.
The definition for DisjunctionMaxQuery is provided at this URL:
http://incubator.apache.org/lucene.net/docs/2.1/Lucene.Net.Search.DisjunctionMaxQuery.html,
Grossly
Hi,
I'm trying to figure what I need to do with Lucene to score a
document higher when it has a larger number of unique search terms
that are hit, rather than term frequency counts.
A quick example.
If I'm searching for "BIRD CAT DOG" (all should clauses), then I want
...a document with B
modifications to the parser anyhow. I also wasn't too hot
on further overloading the meaning of a symbol.
Thanks for the feedback. I hope that at least knowing it IS possible
to do helps some poor soul.
-Walt Stoneburner,
http://www.wwco.com/~wls/blog/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
sample documents,
ingesting, and inspecting them with Luke, while stepping through
source looking at generated queries, it all seems to be working
perfectly.
I'd love to see a formal syntax like this officially enter the Lucene
standard query language someday.
If someone can figure point me at how to do this without twiddling
Lucene's code directly, I'd be happy to contribute the modification.
-Walt Stoneburner
http://www.wwco.com/~wls/blog/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Surprisingly, I didn't see a difference between a simple case like "A
B" and "A B"~10 in the explanation.
Is this a problem with Luke, or did I possibly miss something trivial?
Anyhow, the more important question is: I
do things like:
"LET organization" (where LET is case sensitive, but part of the phrase)
"company LET"~10 (again, where LET is case sensitive, near the term
company which is case insensitive)
Would love to get some thoughts on how t
Mike Klaas elaborates on syntax:
+(-A +B) -> must match (-A +B) -> must contain B and must not contain A
-(-A +B) -> must not match (-A +B) -> must not (match B and not contain A)
Ok, the take-away from this I'm getting is that these clauses read very much
like English and behave just the same.
Steven Parkes points out:
Lucene doesn't use a pure Boolean algebra, so things don't always do
what one might expect and things like De Morgan's law don't hold.
You're exactly on to what I was pondering about. With boolean logic, I
understand the operators inside and out, so something like De
Otis Gospodnetic <[EMAIL PROTECTED]> responded to Walt Stoneburner:
Purely negative queries don't work. Example: -A will not find all
documents that do not have "A".
What I'm trying to do is augment an existing query by appending qualifiers.
If I search for +HORSE
logic at all, but set operations.
A co-worker of mine came up with an interesting syntax, and I had no idea
what it meant either: +( -A -B ) ...which to him it meat "must have no A
and no B".
Can anyone clarify how + and - work on groups, and if the above has any
coherent meaning?
-Walt Stoneburner
Most search engine technologies return result sets based some weighted
frequency of the search terms found. I've got a new problem, I want to rank
by different criteria than I searched for.
For example, I might want to return as my result set all documents that
contain the word pizza, but rank t
When performing a query and getting a result set back, if one wants to
know which terms from the query actually matched, is Highlighter still
the best way to go with the latest Lucene, or should I start looking
at query term frequency vectors?
Just trying to find a non-expensive way of doing this
Have an interesting scenario I'd like to get your take on with respect
to Lucene:
A data provider (e.g. someone with a private website or corporately
shared directory of proprietary documents) has requested their content
be indexed with Lucene so employees can be redirected to it, but
provisional
Erick / Steve,
Thank you both (as well as everyone else who weighed in) on helping get to
a far more optimal solution well before any code was ever slung.
Since we all know that someone else is going to find this in the archives
some day, I'd like to unveil the rest of my ignorance and misconc
Thank you all for the suggestions steering me down the right path.
As an aside, the easy part, at least for me, is extracting the dates
-- Peter was dead on about how doing that: heuristics, multiple
regular expressions, and data structures. As Steve pointed out, this
isn't as trivial as it soun
Been searching http://www.gossamer-threads.com/lists/lucene/java-user/
as Erick suggested; man, is there a wealth of information in the
Lucene archives.
I have found many examples of how to convert text to dates and back,
how to search Date fields for various ranges, and so forth -- but I
don't t
;t paint myself into a corner later?
Thanks,
-Walt Stoneburner
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
On 1/10/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
I'm guessing there is suppose to be some sort of table structure to the
mail you send ... it doesn't work in plain text mail readers so i'm not
sure whta ou were trying to say.
My bad, I was using GMail, and it was trying to produce a ver
*
A B
1
2
2
1
0
0
0
1* 0*
Non-zero results are returned to the user. *Walt Stoneburner
<[EMAIL PROTECTED]> 10-Jan-2007 v1.0*
-wls
On 1/10/07, Mark Miller <[EMAIL PROTECTED]> wrote:
The subtle part is that a scoring system is being used that operates in
something of a boolean fashion, but that has subtle difference.
Mark, -thank you-. This explains it beautifully.
So, if I understand you right, a simple query of NOT OR
what's the difference in behavior?
The fact that the documentation calls out these operators separately,
gives them their own unique names, and describes them in different
terms is enough to make me think something very important or very
subtle is going on.
If anyone could
he word
"ships" and "fate" must be next to each other, but the order is
unimportant, and that a slop of 7 or more would identify the first
sentence (and also the second).
Is there a way to direct Lucene that the order is, or is not,
important? (e.g.
35 matches
Mail list logo