On Tue, Mar 6, 2012 at 5:57 PM, Paul Taylor wrote:
>> Hello,
>>
>> what is previously Similarity in older releases is moved to
>> TFIDFSimilarity: it extends Similarity and exposes a vector-space API,
>> with its same formulas in the javadocs:
>>
>> https://builds.apache.org/view/G-L/view/Lucene/j
On 05/03/2012 23:24, Robert Muir wrote:
On Mon, Mar 5, 2012 at 6:01 PM, Paul Hill wrote:
I would definitely not suggest using SSS for fields like legal brief text or
emails where there is huge
variability in the length of the content -- i can't think of any context where a
"short" email is
de
On 05/03/2012 19:26, Chris Hostetter wrote:
: very small to occasionally very large. It also might be the case that
: cover letters and e-mails while short might not be really something to
: heavily discount. The lower discount range can be ignored by setting
: the min of any sweet spot to 1.
On Mon, Mar 5, 2012 at 6:01 PM, Paul Hill wrote:
>> I would definitely not suggest using SSS for fields like legal brief text or
>> emails where there is huge
>> variability in the length of the content -- i can't think of any context
>> where a "short" email is
>> definitively better/worse then
> -Original Message-
> My only thought is that the new stuff seems to be at the expense of the
> formulas listed in the old
> class overview for Similarity.
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/searc
> h/Similarity.html
Opps, my bad
> I would definitely not suggest using SSS for fields like legal brief text or
> emails where there is huge
> variability in the length of the content -- i can't think of any context
> where a "short" email is
> definitively better/worse then a "long" email. more traditional TF/IDF seems
> like
: very small to occasionally very large. It also might be the case that
: cover letters and e-mails while short might not be really something to
: heavily discount. The lower discount range can be ignored by setting
: the min of any sweet spot to 1. Then one starts to wonder if there is
: r
;thetamin4=0&thetamax0=2pi&thetamax1=2pi&thetamax2=2pi&thetamax3=2pi&thetamax4=2pi&ipw=1&ixmin=-50&ixmax=150&iymin=-0.5&iymax=1.5&igx=10&igy=0.25&igl=1&igs=1&iax=0&ila=1&xmin=-50&xmax=150&ymin=-0.5&ymax=1.5
It is hard
: i'll try to get some graphs commited and linked to from the javadocs that
: make it more clear how tweaking the settings affect the formula
http://svn.apache.org/viewvc?rev=1294920&view=rev
-Hoss
-
To unsubscribe, e-mail:
: A picture -- or more precisely a graph -- would be worth a 1000 words.
fair enough. I think the reason i never committed one initially was
because the formula in the javadocs was trivial to plot in gnuplot...
gnuplot> min=0
gnuplot> max=2
gnuplot> base=1.3
gnuplot> xoffset=10
gnuplot> set
> -Original Message-
> From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
> As for what hyperbolicTf is trying to do ... it creates a hyperbolic function
> letting you specify a hard max
> no matter how many terms there are.
A picture -- or more precisely a graph -- would be worth a
: sloppyFreq(distance). hyperbolicTf() only comes into play if you
: override the tf method in your own subclass to call it instead of the
: baselineTf which it normally calls. I also didn't get what it was
: trying to do.
Correct, as documented...
http://lucene.apache.org/core/old_versioned
I'd love to hear what you find out. I have been working with this also.
I only changed the sweet spot to a slightly larger range than the one in the
original paper (but kept the same steepness) and I tweaked the sloppy freq to
not score multiple occurances of a phrase as strong as the they are i
Have you tried query time boosting of title queries? title:lucene^4
content:lucene. Might be easier than fiddling with sweetspot
arguments, although I see from the javadocs that "A per field min/max
can be specified if different fields have different sweet spots". Not
sure if that is relevant to
14 matches
Mail list logo