Seattle Hadoop/Lucene/NoSQL Meetup; Wed Feb 24th, Feat. MongoDB

2010-02-16 Thread Bradford Stephens
Greetings, It's time for another awesome Seattle Hadoop/Lucene/Scalability/NoSQL Meetup! As always, it's at the University of Washington, Allen Computer Science building, Room 303 at 6:45pm. You can find a map here: http://www.washington.edu/home/maps/southcentral.html?cse Last month, we had a g

Re: CompareBottom in FieldComparator

2010-02-16 Thread Michael McCandless
The API is definitely confusing. setBottom is called by Lucene to notify your FieldComparator which slot holds the "weakest" entry. You can at that point cache that entry (eg IntComparator stores the bottom int value at that point), or, simply store that bottom slot in an instance variable. The

Re: Flex & Segment Merging

2010-02-16 Thread Michael McCandless
Hi Renaud, You should be able to do your own merging by overriding the merge method of Fields/Terms/PostingsConsumer classes, in your codec. Each of these classes has a default impl for merge, which just does the normal postings merging (fields/terms are merged, docs/positions/payloads are concat

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
Joaquin, I have a typical methodology where I don't optimize any scoring params: be it BM25 params (I stick with your defaults), or lnb.ltc params (i stick with default slope). When doing query expansion i don't modify the defaults for MoreLikeThis either. I've found that changing these params can

Re: lucene webinterface

2010-02-16 Thread Paul Libbrecht
On 16-févr.-10, at 17:40, luciusvorenus wrote: how can I build a webinterface for my aplication ? I read something with HTML table and php but i had no idea? Can anobody help me? Lucius, try solr. paul - To unsubscribe,

Re: BM25 Scoring Patch

2010-02-16 Thread JOAQUIN PEREZ IGLESIAS
Ok, I'm not advocating the BM25 patch neither, unfortunately BM25 was not my idea :-))), and I'm sure that the implementation can be improved. When you use the BM25 implementation, are you optimising the parameters specifically per collection? (It is a key factor for improving BM25 performance).

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
Joaquin, I don't see this as a flame war? First of all I'd like to personally thank you for your excellent BM25 implementation! I think the selection of a retrieval model depends highly on the language/indexing approach, i.e. if we were talking East Asian languages I think we want a probabilistic

Re: BM25 Scoring Patch

2010-02-16 Thread JOAQUIN PEREZ IGLESIAS
Just some final comments (as I said I'm not interested in flame wars), If I obtain better results there are not problem with pooling otherwise it is biased. The only important thing (in my opinion) is that it cannot be said that BM25 is a myth. Yes, you are right there is not an only ranking model

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
no i mean that gathering your previous emails you have supplied these MAP improvements: SweetSpot: 15% lnb.ltc: 24% bm25: 21% these are close enough that given the bias from a pooled collection ( http://www.ir.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf) I wouldn't want to say for sure

Re: BM25 Scoring Patch

2010-02-16 Thread Ivan Provalov
By the end of the week, I will publish the results once we run the experiments on a full collection. Are you talking about the bias caused by using a sub-collection? Thanks, Ivan --- On Tue, 2/16/10, Robert Muir wrote: > From: Robert Muir > Subject: Re: BM25 Scoring Patch > To: java-user@l

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
Ivan, ok. it would be cool if you can list the map and bpref for the different approaches you try (default lucene, lnb.ltc, bm25), with or without stemming. as you reported previously you got a 24% improvement with lnb.btc (right?) I am guessing that we won't be able to draw many conclusions at al

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
I don't think its really a competition, I think preferably we should have the flexibility to change the scoring model in lucene actually? I have found lots of cases where VSM improves on BM25, but then again I don't work with TREC stuff, as I work with non-english collections. It doesn't contradi

CompareBottom in FieldComparator

2010-02-16 Thread Raimon Bosch
Hi, Which is the exactly objective of compareBottom and setBottom functions. I am using a higher numHits to create TopScoreDocCollectors and TopFieldCollectors because I don't understand properly this function. I think that is a filter to send less documents to sort in comparators, but I don't

Re: BM25 Scoring Patch

2010-02-16 Thread Ivan Provalov
Robert, Joaquin, Sorry, I made an error reporting the results. The preliminary improvement is around 21% (it's a reduced collection). I will have to run another test to get the final numbers on the complete collection. We are planning to also apply the stemming. Right now we are trying to

Re: BM25 Scoring Patch

2010-02-16 Thread JOAQUIN PEREZ IGLESIAS
By the way, I don't want to start a flame war VSM vs BM25, but I really believe that I have to express my opinion as Robert has done. In my experience, I have never found a case where VSM improves significantly BM25. Maybe you can find some cases under some very specific collection characteristics

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
Ivan just a little more food for thought to help you with this: I'm glad you got improved results, yet I stand by my original statement of 'be careful' interpreting too much from one collection. eg. had you chosen TREC-4 instead of TREC-3, you would see different results, as vector-space with non

Re: PayloadNearSpanScorer explain method

2010-02-16 Thread Grant Ingersoll
That sounds reasonable. Patch? On Feb 15, 2010, at 10:29 AM, Peter Keegan wrote: > The 'explain' method in PayloadNearSpanScorer assumes the > AveragePayloadFunction was used. I don't see an easy way to override this > because 'payloadsSeen' and 'payloadScore' are private/protected. It seems > l

Re: BM25 Scoring Patch

2010-02-16 Thread JOAQUIN PEREZ IGLESIAS
Hi Ivan, the problem is that unfortunately BM25 cannot be implemented overwriting the Similarity interface. Therefore BM25Similarity only computes the classic probabilistic IDF (what is interesting only at search time). If you set BM25Similarity at indexing time some basic stats are not stored cor

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
cool! i saw you were using StandardAnalyzer too, maybe you want to try using stemming also (as this analyzer does not do stemming)... usually helps. On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov wrote: > Joaquin, Robert, > > I followed Joaquin's recommendation and removed the call to set simil

Re: BM25 Scoring Patch

2010-02-16 Thread Ivan Provalov
Joaquin, Robert, I followed Joaquin's recommendation and removed the call to set similarity to BM25 explicitly (indexer, searcher). The results showed 55% improvement for the MAP score (0.141->0.219) over default similarity. Joaquin, how would setting the similarity to BM25 explicitly make t

lucene webinterface

2010-02-16 Thread luciusvorenus
Hello how can I build a webinterface for my aplication ? I read something with HTML table and php but i had no idea? Can anobody help me? Than u Lucius -- View this message in context: http://old.nabble.com/lucene-webinterface-tp27611202p27611202.html Sent from the Lucene - Java Users mail

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
yes Ivan, if possible please report back any findings you can on the experiments you are doing! On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias < joaquin.pe...@lsi.uned.es> wrote: > Hi Ivan, > > You shouldn't set the BM25Similarity for indexing or searching. > Please try removing the lin

Re: BM25 Scoring Patch

2010-02-16 Thread Joaquin Perez Iglesias
Hi Ivan, You shouldn't set the BM25Similarity for indexing or searching. Please try removing the lines: writer.setSimilarity(new BM25Similarity()); searcher.setSimilarity(sim); Please let us/me know if you improve your results with these changes. Robert Muir escribió: Hi Ivan, I've seen

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
Hi Ivan, I've seen many cases where BM25 performs worse than Lucene's default Similarity. Perhaps this is just another one? Again while I have not worked with this particular collection, I looked at the statistics and noted that its composed of several 'sub-collections': for example the PAT docume

BM25 Scoring Patch

2010-02-16 Thread Ivan Provalov
I applied the Lucene patch mentioned in https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP numbers on TREC-3 collection using topics 151-200. I am not getting worse results comparing to Lucene DefaultSimilarity. I suspect, I am not using it correctly. I have single field docum

Flex & Segment Merging

2010-02-16 Thread Renaud Delbru
Hi, I would like to start creating a codec with my own set of index files (instead of using the ones from the Standard codec). I have multiple questions (I haven't yet found answers by myself) like: how to specify how these files should be merged ? Is it automatically done by the Codec interfa

questions on upgrading to 3.0: Version.LUCENE_* and Field.setOmitNorms()

2010-02-16 Thread jm
Hi, previously I was using 2.9 (upgraded from 2.4 but did not fix warnings etc). Now I have upgraded to 3.0, so I had to fix all deprecated methods etc. My question is with Version type parameter in some Token* classes. Some of our customers have our product with lucene 2.4 (some upgraded from 2.

RE: Strange Fuzzyquery results scoring when using a low minimal distance

2010-02-16 Thread Uwe Schindler
The problem ist he following: The docFreq of the term "lucéne" is 2, all other terms have 1 (because StandardAnalyzer lowercases everything). What happens is, that terms with lower docFreq get a higher score in TermQuery. This score overweighs the boosting done by FuzzyQuery (because you index i

Re: Strange Fuzzyquery results scoring when using a low minimal distance

2010-02-16 Thread stefcl
Thanksa lot, But I still don't understand why raising a little bit the min similarity change the ordering... markharw00d wrote: > > This could be down to IDF ie "Lucane" is ranked higher because it is rarer > despite having worse edit distance. > This is arguably a bug. > See http://issues.apa