First, apologies for what seems to be something of an FAQ.
However, I've not been able to find an answer either in LIA or in the
relevant section of the FAQ (http://wiki.apache.org/jakarta-lucene/
LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)
My setup is as follows: I have an index of a few hundred thousand web
pages. I'd like the be able to construct queries that search for some
arbitrary text within a specified URL. Kind of like google's syntax
searchterm +site:www.foo.com/some/section
So, I have the page title & content indexed, and the URL stored as a
keywords field, and I imagined that I'd be able to construct a query
something like this:
String[] fields = new String[]
{DocumentFields.TITLE,DocumentFields.CONTENT};
Query searchTextQuery = MultiFieldQueryParser.parse
(request.getSearchQuery(), fields, analyzer);
PrefixQuery urlPrefix = new PrefixQuery(new Term(DocumentFields.URL,
request.getUrlPrefix()));
hits = searcher.search(searchTextQuery, new QueryFilter(urlPrefix));
However, as soon as the set of documents returned by the prefixquery
is more than a thousand or so, I get a TooManyClausesException, as
you might expect.
AFAICS the solutions suggested in the FAQ don't seem to apply here:
I'm already using a Filter, and that's not helping (pace suggestion
1), I don't think I can reduce the number of terms in the index, else
my URLs wouldn't be unique any more, and increasing the number of
clauses seems like a poor choice from a scalability point of view - I
anticipate queries that could filter perhaps a hundred thousand
documents or so.
I'm guessing that it might be possible to do something smart by
splitting the URL up into multiple fields - for example, one for the
host and one for the path, or even one for the host and one for host
+path together - but I'm not clear on exactly how I'd use the two
fields, and how they'd help. Can someone enlighten me?
Thanks in advance
Chris
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]