Re: Duplicate content filter..

Lawrence D'Oliveiro Thu, 04 Oct 2007 21:51:38 -0700

In message <[EMAIL PROTECTED]>, Abandoned
wrote:

> I want to a idea for how can i find duplicate pages quickly and fast ?


Compute a hash based on a canonicalized version of the content? Disregard
white space, line wrap, upper/lower case, possibly even punctuation etc so
that you get the same hash in spite of these differences.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Duplicate content filter..

Reply via email to