Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Thank you Phil and Shai. I will write a different Analyzer. On Sun, Aug 2, 2009 at 11:50 PM, Shai Erera wrote: > You can always create your own Analyzer which creates a TokenStream just > like StandardAnalyzer, but instead of using StandardFilter, write another > TokenFilter which receives the

Re: Weird behaviour

2009-08-02 Thread Shai Erera
You can always create your own Analyzer which creates a TokenStream just like StandardAnalyzer, but instead of using StandardFilter, write another TokenFilter which receives the HOST token type, and breaks it further to its components (e.g., extract "en", "wikipedia" and "org"). You can also return

Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Hi Phil, The query you gave did work. Well, that proves StandardAnalyzer has a different way of tokenizing URLs. Thanks, Prashant. On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan wrote: > Hi Prashant, > > I agree with Shai, that using Luke and printing out what the Document > looks like before it

Re: Weird behaviour

2009-08-02 Thread Phil Whelan
Hi Prashant, I agree with Shai, that using Luke and printing out what the Document looks like before it goes into the index, are going to be your best bet for debugging this problem. The problem you're having is that StandardAnalyzer does not break-up the hostname into separate terms, as it has a

Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Yes, I'm sure that title:"Rahul Dravid" is extracted properly, and there is a document relevant to this query as well. The following query and its results proves it: Enter query: Searching for: +title:"rahul dravid" +url:wiki 4 total matching documents trec-id: clueweb09-enwp02-13-14368, URL: h

Re: Weird behaviour

2009-08-02 Thread Shai Erera
How do you parse/convert the page to a Document object? Are you sure the title "Rahul Dravid" is extracted properly and put in the "title" field? You can read about Luke here: http://www.getopt.org/luke/. Can you do System.out.println(document.toString()) before you add it to the index, and paste

Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Firstly, I'm indexing the string in url field only. I've never used Luke, I don't know how to use. What I'm trying to do is search for those documents which are from some particular site, and have a given title. On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera wrote: > You write that you index the

Re: Weird behaviour

2009-08-02 Thread Shai Erera
You write that you index the string under the "url" field. Do you also index it under "title"? If not, that can explain why title:"Rahul Dravid" does not work for you. Also, did you try to look at the index w/ Luke? It will show you what are the terms in the index. Another thing which is always g