In message <[EMAIL PROTECTED]>, Abandoned wrote: > I want to a idea for how can i find duplicate pages quickly and fast ?
Compute a hash based on a canonicalized version of the content? Disregard white space, line wrap, upper/lower case, possibly even punctuation etc so that you get the same hash in spite of these differences. -- http://mail.python.org/mailman/listinfo/python-list