Yes, I'm sure that title:"Rahul Dravid" is extracted properly, and there is a document relevant to this query as well. The following query and its results proves it:
Enter query: Searching for: +title:"rahul dravid" +url:wiki 4 total matching documents trec-id: clueweb09-enwp02-13-14368, URL: http://en.wikipedia.org/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-83-11378, URL: http://en.wikipedia.org/wiki/Rahul_S_Dravid trec-id: clueweb09-en0011-08-22737, URL: http://www.reference.com/browse/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-69-13556, URL: http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid Press (q)uit or enter number to jump to a page. But see following query: Enter query: +title:"rahul dravid" +url:"wikipedia" Searching for: +title:"rahul dravid" +url:wikipedia 0 total matching documents Press (q)uit or enter number to jump to a page. Isn't it weird? -- Prashant. On Sun, Aug 2, 2009 at 9:13 PM, Shai Erera <ser...@gmail.com> wrote: > How do you parse/convert the page to a Document object? Are you sure the > title "Rahul Dravid" is extracted properly and put in the "title" field? > > You can read about Luke here: http://www.getopt.org/luke/. > > Can you do System.out.println(document.toString()) before you add it to the > index, and paste the output here? > > Shai > > On Sun, Aug 2, 2009 at 4:47 PM, prashant ullegaddi < > prashullega...@gmail.com > > wrote: > > > Firstly, I'm indexing the string in url field only. > > > > I've never used Luke, I don't know how to use. > > > > What I'm trying to do is search for those documents which are from > > some particular site, and have a given title. > > > > > > On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera <ser...@gmail.com> wrote: > > > > > You write that you index the string under the "url" field. Do you also > > > index > > > it under "title"? If not, that can explain why title:"Rahul Dravid" > does > > > not > > > work for you. > > > > > > Also, did you try to look at the index w/ Luke? It will show you what > are > > > the terms in the index. > > > > > > Another thing which is always good to debug such things is to create a > > > StandardAnalyzer, then request a tokenStream() from it, passing a > > > StringReader w/ the text you want to parse. Then just print the tokens > > > returned. > > > > > > I've done that, using the version from trunk, w/ Version.2_4, and the > > > tokens > > > that are extracted are: > > > (http,0,4,type=<ALPHANUM>) > > > (en.wikipedia.org,7,23,type=<HOST>) > > > (wiki,24,28,type=<ALPHANUM>) > > > (rahul,29,34,type=<ALPHANUM>) > > > (dravid,35,41,type=<ALPHANUM>) > > > > > > So: > > > 1) You don't get results for title:"Rahul Dravid" since you index it > > under > > > "url" and not "title". > > > 2) url:"wiki/Rahul_Dravid" works, since it looks for a phrase that > exists > > > in > > > the index (look at the last 3 tokens produced by the Analyzer, in the > > > output > > > above). > > > 3) ur:"<entire string" also works, since you index all of it under the > > > "url" > > > field. > > > > > > Does this explain the behavior you see? > > > > > > Shai > > > > > > On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi < > > > prashullega...@gmail.com > > > > wrote: > > > > > > > Hi, > > > > > > > > I've indexed some 50million documents. I've indexed the target URL of > > > each > > > > document as "url" field by using > > > > StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia > > page > > > > with title:"Rahul Dravid" and > > > > url: http://en.wikipedia.org/wiki/Rahul_Dravid. > > > > > > > > But when I search for +title:"Rahul Dravid" +url:"Wikipedia", I'm > > getting > > > > no > > > > results. I get the document(s) when > > > > I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url:" > > > > en.wikipedia.org/wiki/Rahul_Dravid". I get > > > > results even when I search for url:"wiki/Rahul_Dravid". > > > > > > > > It'd be helpful if somebody can throw some light on this. > > > > > > > > -- Prashant. > > > > > > > > > >