Hi.. I'm working a search engine project now. And i have a problem. My problem is Duplicate Contents.. I can find the percentage of similarity between two pages but i have a 5 millions index and i search 5 million page contents to find one duplicate :(
I want to a idea for how can i find duplicate pages quickly and fast ? Please help me, i'm sorry my bad english. King regards.. -- http://mail.python.org/mailman/listinfo/python-list