Hi Gopalakrishnan, You can follow the following algorithm for clustering the news items:
1. Tokenize the headline. 2. Remove the stop words from the headlines, i.e., words like "a", "an", "is", "the", etc. 3. Generate shingles from the remaining words. ex, 4-shignles for watches generates the following. ['wa', 'wat', 'watc'] ['at', 'atc', 'atch'] ['tc', 'tch', 'tche'] ['ch', 'che', 'ches'] ['he', 'hes'] ['es''] 4. Calculate Jaccard similarity between each pair of headlines. This will result in a "n*n" matrix for "n" news headlines. Jaccard similarity = (number of common singles between HEAD_a and HEAD_b) / (number of unique singles in HEAD_a and HEAD_b combined) 5. Cluster the headlines constrained by a parameterized MIN_SIMILARITY_THRESHOLD. Regards, Devjyoti _______________________________________________ BangPypers mailing list BangPypers@python.org http://mail.python.org/mailman/listinfo/bangpypers