Hi Gopalakrishnan,
You can follow the following algorithm for clustering the news items:
1. Tokenize the headline.
2. Remove the stop words from the headlines, i.e., words like "a",
"an", "is", "the", etc.
3. Generate shingles from the remaining words.
ex, 4-shignles for watches generates the following.
['wa', 'wat', 'watc']
['at', 'atc', 'atch']
['tc', 'tch', 'tche']
['ch', 'che', 'ches']
['he', 'hes']
['es'']
4. Calculate Jaccard similarity between each pair of headlines. This
will result in a "n*n" matrix for "n" news headlines.
Jaccard similarity = (number of common singles between HEAD_a and
HEAD_b) / (number of unique singles in HEAD_a and HEAD_b combined)
5. Cluster the headlines constrained by a parameterized
MIN_SIMILARITY_THRESHOLD.
Regards,
Devjyoti
_______________________________________________
BangPypers mailing list
[email protected]
http://mail.python.org/mailman/listinfo/bangpypers