Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...)

Chris May Wed, 27 Jul 2005 13:56:19 -0700

Always domain + part of a path e.g.

url:http://blogs.warwick.ac.uk/chrismay/*

or

url:http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/modules/commonlaw/*


or

url:http://www2.warwick.ac.uk/services/its/*

... and so on. Part of the problem is that we may need to go anarbitrary number of levels down the path to get an acceptably smallset of documents to start from - we couldn't impose a rule that saidsomething like 'specify the first 2 directories on the path' (c.f mysecond example). We wouldn't need to query for the same path overdifferent domains though (e.g. url:*.warwick.ac.uk/about/* )


thanks

Chris




On 27 Jul 2005, at 21:33, Erik Hatcher wrote:

Could you give some examples of the types of PrefixQuery's you'dlike to use? Is it always at a granularity of domain and path?Or are you wanting to do a prefix pieces of the domain and path?
    Erik

On Jul 27, 2005, at 3:47 PM, Chris May wrote:
First, apologies for what seems to be something of an FAQ.
However, I've not been able to find an answer either in LIA or inthe relevant section of the FAQ (http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)
My setup is as follows: I have an index of a few hundred thousandweb pages. I'd like the be able to construct queries that searchfor some arbitrary text within a specified URL. Kind of likegoogle's syntax
searchterm +site:www.foo.com/some/section
So, I have the page title & content indexed, and the URL stored asa keywords field, and I imagined that I'd be able to construct aquery something like this:
String[] fields = new String[]{DocumentFields.TITLE,DocumentFields.CONTENT};Query searchTextQuery = MultiFieldQueryParser.parse(request.getSearchQuery(), fields, analyzer);PrefixQuery urlPrefix = new PrefixQuery(new Term(DocumentFields.URL, request.getUrlPrefix()));
hits = searcher.search(searchTextQuery, new QueryFilter(urlPrefix));
However, as soon as the set of documents returned by theprefixquery is more than a thousand or so, I get aTooManyClausesException, as you might expect.
AFAICS the solutions suggested in the FAQ don't seem to applyhere: I'm already using a Filter, and that's not helping (pacesuggestion 1), I don't think I can reduce the number of terms inthe index, else my URLs wouldn't be unique any more, andincreasing the number of clauses seems like a poor choice from ascalability point of view - I anticipate queries that could filterperhaps a hundred thousand documents or so.
I'm guessing that it might be possible to do something smart bysplitting the URL up into multiple fields - for example, one forthe host and one for the path, or even one for the host and onefor host+path together - but I'm not clear on exactly how I'd usethe two fields, and how they'd help. Can someone enlighten me?
Thanks in advance

Chris





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...)

Reply via email to