Re: Weird behaviour

prashant ullegaddi Sun, 02 Aug 2009 10:14:44 -0700

Yes, I'm sure that title:"Rahul Dravid" is extracted properly, and there is
a document relevant to this query as well.
The following query and its results proves it:


Enter query:
Searching for: +title:"rahul dravid" +url:wiki
4 total matching documents
   trec-id: clueweb09-enwp02-13-14368, URL:
http://en.wikipedia.org/wiki/Rahul_Dravid
   trec-id: clueweb09-enwp01-83-11378, URL:
http://en.wikipedia.org/wiki/Rahul_S_Dravid
   trec-id: clueweb09-en0011-08-22737, URL:
http://www.reference.com/browse/wiki/Rahul_Dravid
   trec-id: clueweb09-enwp01-69-13556, URL:
http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid
Press (q)uit or enter number to jump to a page.

But see following query:

Enter query:
+title:"rahul dravid" +url:"wikipedia"
Searching for: +title:"rahul dravid" +url:wikipedia
0 total matching documents
Press (q)uit or enter number to jump to a page.

Isn't it weird?

-- Prashant.

On Sun, Aug 2, 2009 at 9:13 PM, Shai Erera <ser...@gmail.com> wrote:

> How do you parse/convert the page to a Document object? Are you sure the
> title "Rahul Dravid" is extracted properly and put in the "title" field?
>
> You can read about Luke here: http://www.getopt.org/luke/.
>
> Can you do System.out.println(document.toString()) before you add it to the
> index, and paste the output here?
>
> Shai
>
> On Sun, Aug 2, 2009 at 4:47 PM, prashant ullegaddi <
> prashullega...@gmail.com
> > wrote:
>
> > Firstly, I'm indexing the string in url field only.
> >
> > I've never used Luke, I don't know how to use.
> >
> > What I'm trying to do is search for those documents which are from
> > some particular site, and have a given title.
> >
> >
> > On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera <ser...@gmail.com> wrote:
> >
> > > You write that you index the string under the "url" field. Do you also
> > > index
> > > it under "title"? If not, that can explain why title:"Rahul Dravid"
> does
> > > not
> > > work for you.
> > >
> > > Also, did you try to look at the index w/ Luke? It will show you what
> are
> > > the terms in the index.
> > >
> > > Another thing which is always good to debug such things is to create a
> > > StandardAnalyzer, then request a tokenStream() from it, passing a
> > > StringReader w/ the text you want to parse. Then just print the tokens
> > > returned.
> > >
> > > I've done that, using the version from trunk, w/ Version.2_4, and the
> > > tokens
> > > that are extracted are:
> > > (http,0,4,type=<ALPHANUM>)
> > > (en.wikipedia.org,7,23,type=<HOST>)
> > > (wiki,24,28,type=<ALPHANUM>)
> > > (rahul,29,34,type=<ALPHANUM>)
> > > (dravid,35,41,type=<ALPHANUM>)
> > >
> > > So:
> > > 1) You don't get results for title:"Rahul Dravid" since you index it
> > under
> > > "url" and not "title".
> > > 2) url:"wiki/Rahul_Dravid" works, since it looks for a phrase that
> exists
> > > in
> > > the index (look at the last 3 tokens produced by the Analyzer, in the
> > > output
> > > above).
> > > 3) ur:"<entire string" also works, since you index all of it under the
> > > "url"
> > > field.
> > >
> > > Does this explain the behavior you see?
> > >
> > > Shai
> > >
> > > On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi <
> > > prashullega...@gmail.com
> > > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I've indexed some 50million documents. I've indexed the target URL of
> > > each
> > > > document as "url" field by using
> > > > StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia
> > page
> > > > with title:"Rahul Dravid" and
> > > > url: http://en.wikipedia.org/wiki/Rahul_Dravid.
> > > >
> > > > But when I search for +title:"Rahul Dravid" +url:"Wikipedia", I'm
> > getting
> > > > no
> > > > results. I get the document(s) when
> > > > I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url:"
> > > > en.wikipedia.org/wiki/Rahul_Dravid". I get
> > > > results even when I search for url:"wiki/Rahul_Dravid".
> > > >
> > > > It'd be helpful if somebody can throw some light on this.
> > > >
> > > > -- Prashant.
> > > >
> > >
> >
>

Re: Weird behaviour

Reply via email to