Could you give some examples of the types of PrefixQuery's you'd like
to use? Is it always at a granularity of domain and path? Or are
you wanting to do a prefix pieces of the domain and path?
Erik
On Jul 27, 2005, at 3:47 PM, Chris May wrote:
First, apologies for what seems to be something of an FAQ.
However, I've not been able to find an answer either in LIA or in
the relevant section of the FAQ (http://wiki.apache.org/jakarta-
lucene/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)
My setup is as follows: I have an index of a few hundred thousand
web pages. I'd like the be able to construct queries that search
for some arbitrary text within a specified URL. Kind of like
google's syntax
searchterm +site:www.foo.com/some/section
So, I have the page title & content indexed, and the URL stored as
a keywords field, and I imagined that I'd be able to construct a
query something like this:
String[] fields = new String[]
{DocumentFields.TITLE,DocumentFields.CONTENT};
Query searchTextQuery = MultiFieldQueryParser.parse
(request.getSearchQuery(), fields, analyzer);
PrefixQuery urlPrefix = new PrefixQuery(new Term
(DocumentFields.URL, request.getUrlPrefix()));
hits = searcher.search(searchTextQuery, new QueryFilter(urlPrefix));
However, as soon as the set of documents returned by the
prefixquery is more than a thousand or so, I get a
TooManyClausesException, as you might expect.
AFAICS the solutions suggested in the FAQ don't seem to apply here:
I'm already using a Filter, and that's not helping (pace suggestion
1), I don't think I can reduce the number of terms in the index,
else my URLs wouldn't be unique any more, and increasing the number
of clauses seems like a poor choice from a scalability point of
view - I anticipate queries that could filter perhaps a hundred
thousand documents or so.
I'm guessing that it might be possible to do something smart by
splitting the URL up into multiple fields - for example, one for
the host and one for the path, or even one for the host and one for
host+path together - but I'm not clear on exactly how I'd use the
two fields, and how they'd help. Can someone enlighten me?
Thanks in advance
Chris
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]