Re: Please help to interpret Lucene Boost results

Erick Erickson Mon, 29 Sep 2008 09:11:40 -0700

OK, you're officially beyond where I can help. But the
rewritten query is your problem, and I'm going to appeal to
people who understand things waaay better than I do to answer
it. Can you recognize it when I run away <G>...


You might consider posting that question over on the nutch
user's list if you don't get anything here, but I'm pretty sure that
some folks who are very knowledgeable about nutch might
see it here.

Best
Erick

On Mon, Sep 29, 2008 at 11:07 AM, student_t <[EMAIL PROTECTED]> wrote:

>
> Thank you Erick!
>
> I got Luke and it's a great tool! I verified from Luke my queries posted
> originally worked as expected (i.e., "Canadian pepsi" produced fewer
> results
> than "pepsi" along.)
>
> Based on your suggestion, I found out the program re-wrote the query before
> it was sent to Nutch as the following:
>
> (content:"+(content:pepsi) +((host:ca^10.0)^10.0)")
>
> I think this is messing up the query results. Literally, this is what the
> code was doing:
>
> Query nutchQuery = new Query();
> nutchQuery.addRequiredTerm("+(content:pepsi) +((host:ca^10.0)^10.0)",
> "content");
> nutchBean.search(nutchQuery, maxHits, maxHitsPerSite);
>
> The above code must be using a default analyzer, which I have not found out
> yet. But after looking at Luke results and this code, I think my problem is
> in the query construction. Could you shed some light about how the new
> query
> be constructed based on the following String?
>
> +(content:pepsi) +((host:ca^10.0)^10.0)
>
> Thanks again!
>
>
> Erick Erickson wrote:
> >
> > For Luke, see http://www.getopt.org/luke/ or just google lucene luke
> >
> > About the same analyzers at index and search. No, it's not *required*
> > that you use the same analyzer, but unless and until you understand
> > what analyzers do, I would *strongly* recommend that you use the
> > same one. See PerFieldAnalyzerWrapper for using different analyzers
> > for different fields....
> >
> > Conceptually, think of an analyzer as giving you the "smallest meaningful
> > token" from your stream that your program then does something with,
> > either index or analyze.
> >
> > Consider the line "this is the 2nd Line in Erick's program". What are the
> > tokens?
> > Well, depending on the analyzer you could get
> > nd, line, erick, program (assuming your analyzer lowercases, strips
> > numbers
> > and punctuation and removes stopwords)
> >
> > You could get
> > "this is the 2nd Line in Erick's program" (assuming your analyzer did
> > nothing, just returned the entire input stream as a token, which would
> > then
> > NOT get a hit searching, say, 'Line')
> >
> > You could get
> > this, is, the, 2nd, line, in erick's, program (assuming your analyzer
> just
> > lowercased and split on whitespace)
> >
> > You could get
> > nonsense, nonsense, nonsense, nonsense (assuming you wrote a pathological
> > analyzer that always returned the word "nonsense")
> >
> > Really, any transformations you wanted to perform can be done in an
> > analyzer.
> > So using different analyzers during indexing and searching will very
> often
> > produce "surprising" results. Don't do it unless you know exactly what's
> > being
> > done to your input streams. Or be prepared to spend a lot of hours
> > tracking
> > down "what went wrong" <G>.
> >
> > All that said, it doesn't explain the behaviour you're seeing because
> > you're
> > right,
> > adding successively more restrictions should produce smaller result
> > counts.
> > I suspect
> > that if you see what actual queries you're generating (and when you get
> > Luke, be
> > sure to find the drop-down for "which analyzer" to use) you'll find
> > something
> > surprising.
> >
> > Best
> > Erick
> >
> >
> > On Fri, Sep 26, 2008 at 6:57 PM, student_t <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> Hi Eric,
> >>
> >> Thanks a bunch for your pointers. I will need to find out the analyzers
> >> at
> >> index and query time. But is it critical to have the same analyzers
> >> during
> >> these two times?
> >>
> >> I had tested with lucli from some of my local segment data and they
> >> appeared
> >> working fine (i.e., their result sets are reasonable.)
> >>
> >> Is Luke part of Lucene contrib? I recall there is a GUI that lets you
> >> view
> >> the indices. Would you please elaborate?
> >>
> >> Thanks again!
> >> student_t
> >>
> >>
> >> Erick Erickson wrote:
> >> >
> >> > That certainly doesn't look right. What analyzers are you using at
> >> index
> >> > and query time?
> >> >
> >> > Two things that will help track down what's really happening:
> >> >
> >> > 1> query.toString() is your friend.
> >> > 2> get a copy of the excellent Luke tool and have it do its explain
> >> magic
> >> > on
> >> > your query. Watch that the analyzer you choose when querying is what
> >> you
> >> > expect....
> >> >
> >> > If neither of those things sheds any light on the problem, let us know
> >> > what
> >> > you find....
> >> >
> >> > Best
> >> > Erick
> >> >
> >> > On Fri, Sep 26, 2008 at 3:55 PM, student_t <[EMAIL PROTECTED]> wrote:
> >> >
> >> >>
> >> >> I am baffled by the results of the following queries. Can it be
> >> something
> >> >> to
> >> >> do with the boosting factor? All of these queries are performed in
> the
> >> >> same
> >> >> environment with the same crawled index/data.
> >> >>
> >> >> A. query1 = +(content:(Pepsi))                              resulted
> >> in
> >> >> 228
> >> >> hits.
> >> >> B. query2 = +(content:(Pepsi) ) +(host:(ca)^10 )     resulted in 398
> >> >> hits.
> >> >> C. query3 = +(host:(ca)^10 )                                resulted
> >> in
> >> >> 212
> >> >> hits.
> >> >>
> >> >> Two questions (strictly just one):
> >> >> 1. query1 of any content contains Pepsi yielded 228 hits, how could a
> >> >> more
> >> >> limiting query2 (give me all docs that have Pepsi in it with a domain
> >> of
> >> >> ca)
> >> >> yield more hits (398)?
> >> >> 2. Since there are 212 hits of Canadian domains, how can query2
> return
> >> >> 398
> >> >> hits?
> >> >>
> >> >> Thanks for any pointers!
> >> >> Cheers,
> >> >> student_t
> >> >>
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >>
> >>
> http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19695313.html
> >> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >> >>
> >> >>
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19697605.html
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19725730.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Please help to interpret Lucene Boost results

Reply via email to