Greetings,
It's time for another awesome Seattle Hadoop/Lucene/Scalability/NoSQL Meetup!
As always, it's at the University of Washington, Allen Computer
Science building, Room 303 at 6:45pm. You can find a map here:
http://www.washington.edu/home/maps/southcentral.html?cse
Last month, we had a g
The API is definitely confusing.
setBottom is called by Lucene to notify your FieldComparator which
slot holds the "weakest" entry. You can at that point cache that
entry (eg IntComparator stores the bottom int value at that point),
or, simply store that bottom slot in an instance variable.
The
Hi Renaud,
You should be able to do your own merging by overriding the merge
method of Fields/Terms/PostingsConsumer classes, in your codec. Each
of these classes has a default impl for merge, which just does the
normal postings merging (fields/terms are merged,
docs/positions/payloads are concat
Joaquin, I have a typical methodology where I don't optimize any scoring
params: be it BM25 params (I stick with your defaults), or lnb.ltc params (i
stick with default slope). When doing query expansion i don't modify the
defaults for MoreLikeThis either.
I've found that changing these params can
On 16-févr.-10, at 17:40, luciusvorenus wrote:
how can I build a webinterface for my aplication ? I read
something with
HTML table and php but i had no idea?
Can anobody help me?
Lucius,
try solr.
paul
-
To unsubscribe,
Ok,
I'm not advocating the BM25 patch neither, unfortunately BM25 was not my
idea :-))), and I'm sure that the implementation can be improved.
When you use the BM25 implementation, are you optimising the parameters
specifically per collection? (It is a key factor for improving BM25
performance).
Joaquin, I don't see this as a flame war? First of all I'd like to
personally thank you for your excellent BM25 implementation!
I think the selection of a retrieval model depends highly on the
language/indexing approach, i.e. if we were talking East Asian languages I
think we want a probabilistic
Just some final comments (as I said I'm not interested in flame wars),
If I obtain better results there are not problem with pooling otherwise it
is biased.
The only important thing (in my opinion) is that it cannot be said that
BM25 is a myth.
Yes, you are right there is not an only ranking model
no i mean that gathering your previous emails you have supplied these MAP
improvements:
SweetSpot: 15%
lnb.ltc: 24%
bm25: 21%
these are close enough that given the bias from a pooled collection (
http://www.ir.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf) I
wouldn't want to say for sure
By the end of the week, I will publish the results once we run the experiments
on a full collection. Are you talking about the bias caused by using a
sub-collection?
Thanks,
Ivan
--- On Tue, 2/16/10, Robert Muir wrote:
> From: Robert Muir
> Subject: Re: BM25 Scoring Patch
> To: java-user@l
Ivan, ok. it would be cool if you can list the map and bpref for the
different approaches you try (default lucene, lnb.ltc, bm25), with or
without stemming.
as you reported previously you got a 24% improvement with lnb.btc (right?) I
am guessing that we won't be able to draw many conclusions at al
I don't think its really a competition, I think preferably we should have
the flexibility to change the scoring model in lucene actually?
I have found lots of cases where VSM improves on BM25, but then again I
don't work with TREC stuff, as I work with non-english collections.
It doesn't contradi
Hi,
Which is the exactly objective of compareBottom and setBottom functions. I
am using a higher numHits to create TopScoreDocCollectors and
TopFieldCollectors because I don't understand properly this function.
I think that is a filter to send less documents to sort in comparators, but
I don't
Robert, Joaquin,
Sorry, I made an error reporting the results. The preliminary improvement is
around 21% (it's a reduced collection). I will have to run another test to get
the final numbers on the complete collection.
We are planning to also apply the stemming. Right now we are trying to
By the way,
I don't want to start a flame war VSM vs BM25, but I really believe that I
have to express my opinion as Robert has done. In my experience, I have
never found a case where VSM improves significantly BM25. Maybe you can
find some cases under some very specific collection characteristics
Ivan just a little more food for thought to help you with this:
I'm glad you got improved results, yet I stand by my original statement of
'be careful' interpreting too much from one collection.
eg. had you chosen TREC-4 instead of TREC-3, you would see different
results, as vector-space with non
That sounds reasonable. Patch?
On Feb 15, 2010, at 10:29 AM, Peter Keegan wrote:
> The 'explain' method in PayloadNearSpanScorer assumes the
> AveragePayloadFunction was used. I don't see an easy way to override this
> because 'payloadsSeen' and 'payloadScore' are private/protected. It seems
> l
Hi Ivan,
the problem is that unfortunately BM25
cannot be implemented overwriting
the Similarity interface. Therefore BM25Similarity
only computes the classic probabilistic IDF (what is
interesting only at search time).
If you set BM25Similarity at indexing time
some basic stats are not stored
cor
cool! i saw you were using StandardAnalyzer too, maybe you want to try using
stemming also (as this analyzer does not do stemming)... usually helps.
On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov wrote:
> Joaquin, Robert,
>
> I followed Joaquin's recommendation and removed the call to set simil
Joaquin, Robert,
I followed Joaquin's recommendation and removed the call to set similarity to
BM25 explicitly (indexer, searcher). The results showed 55% improvement for
the MAP score (0.141->0.219) over default similarity.
Joaquin, how would setting the similarity to BM25 explicitly make t
Hello
how can I build a webinterface for my aplication ? I read something with
HTML table and php but i had no idea?
Can anobody help me?
Than u
Lucius
--
View this message in context:
http://old.nabble.com/lucene-webinterface-tp27611202p27611202.html
Sent from the Lucene - Java Users mail
yes Ivan, if possible please report back any findings you can on the
experiments you are doing!
On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias <
joaquin.pe...@lsi.uned.es> wrote:
> Hi Ivan,
>
> You shouldn't set the BM25Similarity for indexing or searching.
> Please try removing the lin
Hi Ivan,
You shouldn't set the BM25Similarity for indexing or searching.
Please try removing the lines:
writer.setSimilarity(new BM25Similarity());
searcher.setSimilarity(sim);
Please let us/me know if you improve your results with these changes.
Robert Muir escribió:
Hi Ivan, I've seen
Hi Ivan, I've seen many cases where BM25 performs worse than Lucene's
default Similarity. Perhaps this is just another one?
Again while I have not worked with this particular collection, I looked at
the statistics and noted that its composed of several 'sub-collections': for
example the PAT docume
I applied the Lucene patch mentioned in
https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP numbers on
TREC-3 collection using topics 151-200. I am not getting worse results
comparing to Lucene DefaultSimilarity. I suspect, I am not using it correctly.
I have single field docum
Hi,
I would like to start creating a codec with my own set of index files
(instead of using the ones from the Standard codec). I have multiple
questions (I haven't yet found answers by myself) like:
how to specify how these files should be merged ? Is it automatically
done by the Codec interfa
Hi,
previously I was using 2.9 (upgraded from 2.4 but did not fix warnings
etc). Now I have upgraded to 3.0, so I had to fix all deprecated
methods etc. My question is with Version type parameter in some
Token* classes.
Some of our customers have our product with lucene 2.4 (some upgraded
from 2.
The problem ist he following:
The docFreq of the term "lucéne" is 2, all other terms have 1 (because
StandardAnalyzer lowercases everything). What happens is, that terms with lower
docFreq get a higher score in TermQuery. This score overweighs the boosting
done by FuzzyQuery (because you index i
Thanksa lot,
But I still don't understand why raising a little bit the min similarity
change the ordering...
markharw00d wrote:
>
> This could be down to IDF ie "Lucane" is ranked higher because it is rarer
> despite having worse edit distance.
> This is arguably a bug.
> See http://issues.apa
29 matches
Mail list logo