Re: Removing similar documents from search results

Dawid Weiss Mon, 14 Mar 2005 11:50:36 -0800

I think what they do at Google is a fancy heuristic -- as David Spencer mentioned, suburls of a given page, identical snippets, or titles... My idea was more towards providing a 'realistic overview' of subjects in pages. So you could pick, say, the first document from each cluster and show them like that to the user. Then, in every cluster documents already have mutual similarity (this could be calculated manually, the clustering algorithm doesn't do it for all pairs of documents), but some have more and some have less. You could then hide nearly identical results from the user.

Anyway, I think the Google method is just a heuristic based on URLs and nothing as fancy.

D.

Miles Barr wrote:

Hi Dawid,

On Mon, 2005-03-14 at 18:55 +0100, Dawid Weiss wrote:

I can imagine if you apply clustering to search results anyway then the information about clusters can help you determine 'similar' results and reorder the output list.

That's an interesting idea. How easy is it to 'tighten' the clustering
clones? So say we take a very narrow cone around each result and any
other documents within that cone can be considered similar enough, and
hence not displayed. Then we'd take the document closest to the centre
of the cloud and make that the 'original' copy and display it.

Or would that approach be too expensive to calculate for each search?


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Removing similar documents from search results

Reply via email to