Hi Gopalakrishnan,

You can follow the following algorithm for clustering the news items:

1. Tokenize the headline.
2. Remove the stop words from the headlines, i.e., words like "a",
"an", "is", "the", etc.
3. Generate shingles from the remaining words.
    ex, 4-shignles for watches generates the following.
    ['wa', 'wat', 'watc']
    ['at', 'atc', 'atch']
    ['tc', 'tch', 'tche']
    ['ch', 'che', 'ches']
    ['he', 'hes']
    ['es'']

4. Calculate Jaccard similarity between each pair of headlines. This
will result in a "n*n" matrix for "n" news headlines.
    Jaccard similarity = (number of common singles between HEAD_a and
HEAD_b) / (number of unique singles in HEAD_a and HEAD_b combined)

5. Cluster the headlines constrained by a parameterized
MIN_SIMILARITY_THRESHOLD.

Regards,
Devjyoti
_______________________________________________
BangPypers mailing list
BangPypers@python.org
http://mail.python.org/mailman/listinfo/bangpypers

Reply via email to