Re: SweetSpotSimilarity

2012-03-06 Thread Robert Muir
On Tue, Mar 6, 2012 at 5:57 PM, Paul Taylor wrote: >> Hello, >> >> what is previously Similarity in older releases is moved to >> TFIDFSimilarity: it extends Similarity and exposes a vector-space API, >> with its same formulas in the javadocs: >> >> https://builds.apache.org/view/G-L/view/Lucene/j

Re: SweetSpotSimilarity

2012-03-06 Thread Paul Taylor
On 05/03/2012 23:24, Robert Muir wrote: On Mon, Mar 5, 2012 at 6:01 PM, Paul Hill wrote: I would definitely not suggest using SSS for fields like legal brief text or emails where there is huge variability in the length of the content -- i can't think of any context where a "short" email is de

Re: SweetSpotSimilarity

2012-03-06 Thread Paul Taylor
On 05/03/2012 19:26, Chris Hostetter wrote: : very small to occasionally very large. It also might be the case that : cover letters and e-mails while short might not be really something to : heavily discount. The lower discount range can be ignored by setting : the min of any sweet spot to 1.

Re: SweetSpotSimilarity

2012-03-05 Thread Robert Muir
On Mon, Mar 5, 2012 at 6:01 PM, Paul Hill wrote: >> I would definitely not suggest using SSS for fields like legal brief text or >> emails where there is huge >> variability in the length of the content -- i can't think of any context >> where a "short" email is >> definitively better/worse then

RE: SweetSpotSimilarity

2012-03-05 Thread Paul Hill
> -Original Message- > My only thought is that the new stuff seems to be at the expense of the > formulas listed in the old > class overview for Similarity. > http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/searc > h/Similarity.html Opps, my bad

RE: SweetSpotSimilarity

2012-03-05 Thread Paul Hill
> I would definitely not suggest using SSS for fields like legal brief text or > emails where there is huge > variability in the length of the content -- i can't think of any context > where a "short" email is > definitively better/worse then a "long" email. more traditional TF/IDF seems > like

RE: SweetSpotSimilarity

2012-03-05 Thread Chris Hostetter
: very small to occasionally very large. It also might be the case that : cover letters and e-mails while short might not be really something to : heavily discount. The lower discount range can be ignored by setting : the min of any sweet spot to 1. Then one starts to wonder if there is : r

RE: SweetSpotSimilarity

2012-03-01 Thread Paul Hill
;thetamin4=0&thetamax0=2pi&thetamax1=2pi&thetamax2=2pi&thetamax3=2pi&thetamax4=2pi&ipw=1&ixmin=-50&ixmax=150&iymin=-0.5&iymax=1.5&igx=10&igy=0.25&igl=1&igs=1&iax=0&ila=1&xmin=-50&xmax=150&ymin=-0.5&ymax=1.5 It is hard

RE: SweetSpotSimilarity

2012-02-28 Thread Chris Hostetter
: i'll try to get some graphs commited and linked to from the javadocs that : make it more clear how tweaking the settings affect the formula http://svn.apache.org/viewvc?rev=1294920&view=rev -Hoss - To unsubscribe, e-mail:

RE: SweetSpotSimilarity

2012-02-28 Thread Chris Hostetter
: A picture -- or more precisely a graph -- would be worth a 1000 words. fair enough. I think the reason i never committed one initially was because the formula in the javadocs was trivial to plot in gnuplot... gnuplot> min=0 gnuplot> max=2 gnuplot> base=1.3 gnuplot> xoffset=10 gnuplot> set

RE: SweetSpotSimilarity

2012-02-17 Thread Paul Allan Hill
> -Original Message- > From: Chris Hostetter [mailto:hossman_luc...@fucit.org] > As for what hyperbolicTf is trying to do ... it creates a hyperbolic function > letting you specify a hard max > no matter how many terms there are. A picture -- or more precisely a graph -- would be worth a

RE: SweetSpotSimilarity

2012-02-15 Thread Chris Hostetter
: sloppyFreq(distance). hyperbolicTf() only comes into play if you : override the tf method in your own subclass to call it instead of the : baselineTf which it normally calls. I also didn't get what it was : trying to do. Correct, as documented... http://lucene.apache.org/core/old_versioned

RE: SweetSpotSimilarity

2012-02-15 Thread Paul Allan Hill
I'd love to hear what you find out. I have been working with this also. I only changed the sweet spot to a slightly larger range than the one in the original paper (but kept the same steepness) and I tweaked the sloppy freq to not score multiple occurances of a phrase as strong as the they are i

Re: SweetSpotSimilarity

2011-07-21 Thread Ian Lea
Have you tried query time boosting of title queries? title:lucene^4 content:lucene. Might be easier than fiddling with sweetspot arguments, although I see from the javadocs that "A per field min/max can be specified if different fields have different sweet spots". Not sure if that is relevant to