On Jun 20, 2005, at 9:38 AM, [EMAIL PROTECTED] wrote:

Monday, June 20, 2005, 3:55:36 PM, Erik Hatcher wrote:

Filters reduce the search space to a subset of the documents in the
index.  Which document would you want returned when there are
multiple documents in the index with the same MD5?  Or do you want to
cluster them by MD5?


i think cluster by md5 is more appropriate.


Do you want to cluster them by MD5 perhaps, but still return multiple
documents back from a search?


i want to return just the 1st image (the more relevant one). no use to
show duplicates in an image search app.

Now you've just said the same conflicting thing a different way. You want to cluster but only return one. :)

If you only want one image returned, then it seems that only indexing the same image once is the way to go. When you find a duplicate MD5, don't index that as a second document. You will, instead, update the document by adding additional ALT text and perhaps the additional URL.

Is there a reason why indexing each unique image (by MD5) is not a good way to go in your case?

in sql this would be:
select distinct md5, url, alt from table group by md5 order by score asc;

This would give you multiple records for the same MD5. You said above you only want one per MD5.

    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to