Re: When to use HitCollector?

Michael J. Prichard Sun, 07 Jan 2007 15:19:14 -0800

Hey Erick (and List).

Yeah you have it pretty correct. I am totally kicking myself in the assbecause I wrote all the indexing stuff! Ugh...one of our indexes isabout 8GB so it will take a while to rebuild. I figure I can put atemporary fix out and then refactor the index with the right info andmake it more efficient.

I basically wrote the logic to search email with a FilteredQuery so itcan remove what I don't want and then a seperate document query. Thisis all in a BooleanQuery. I then wrote my own code to parse throughthe hits. It is not the best thing in the world but it seems to work ok.


I appreciate your help!

Thanks,
-Michael

Erick Erickson wrote:

I'm a little fuzzy on the structure of your index, but here's a stab....

First, let me see if I understand your problem...
For an e-mail, you have a body and an attachment that are indexed as
separate lucene documents.
For the body, you include from, to, cc (in other words, meta-data)
For the attachment, you do NOT include the meta-data.

For both the body and the attachment, you have an ID for the parente-mailthat is the same for a body and attachment if they are from the samee-mail

(otherwise I don't see how you "determine the email to display" for an
attachment).

You've got a couple of problems here. Anything you do to break up the

clauses into separate queries will do bad things for your relevancescoring.That is, one query on the body and one on the attachment will give youtwo

lists that you'll then have to manually reconcile if relevancy matters.

Depending upon how many emails and attachments you get hits for, youcould

do something like

1> search for the body elements with the to/from/cc. Use the return(perhaps

with a HitCollector (definitely NOT a Hits object)) to assemble a clause
like ID=52343 or ID=985 or ID=8910 .... and the re-submit some query like

"text contains "search data" and (ID=52343 or ID=985 or ID=8910 ....)"

BEWARE that, depending upon how many e-mails you get, you'll run afoul of

TooManyClauses exceptions. The default is 1,024 but you can make it asbig

as memory/time allows. And, as you say, this is temporary until you
reconstruct your index.

If this is totally irrelevant, perhaps you could add some more detail....

Best
Erick


On 1/7/07, Michael J. Prichard <[EMAIL PROTECTED]> wrote:


I have an index which has email and their attachments indexed.  This is
ok but the issue I am having it when I am trying to filter the
searches.  For example I can search the content of the email and the
document (i.e. the attachment) and return the right results.  Basically,
if it is a document I check the DB to see its parent and determine the
email to display.  The problem comes in when I try to use to, from
and/or cc in my searches.  It will only return emails since we did not
index those fields along with the attachments.  Ideally we would reindex
and add those but I need a temporary fix until we can do that.  SO...I
tried a few various things including a basic search and then filtering
on my own but that seriously slowed our interface since I had to check
each result, etc.  SO...I broke the query into two...search the docs and
emails seperately and only check the documents on return.  That is ok.

I was wondering...would HitCollector be something i should use.
Basically have the searcher check documents to make sure they are ok to
go (i.e. to, from. etc is correct)?

Make sense?

Thanks!
Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: When to use HitCollector?

Reply via email to