[Corpora-List] Re: Counting multiple long (9+) n-grams in corpora: request for approaches

Darren Cook via Corpora Fri, 23 Jun 2023 07:55:40 -0700

many repeated exact tweets, or very similar tweets, leading to long
super strings of often 9 or 10 or more words together.

One approach that came to mind was https://arxiv.org/abs/2112.11446where they remove duplicate documents if the 13-gram jaccard similarityis over 0.8. (13-grams exclude spaces and punc.)

For tweets, if you are interested in up to 10-grams, you could find the11-grams, and throw away tweets that have an identical 11-gram?

If data set size is the problem for discovering and removing duplicatetweets, look into bloom filters.

For a ready-made package, https://docs.dedupe.io/en/latest/ was the onethat came up a lot in my search just now. (I don't know how it scales,though.)


HTH,
Darren
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] Re: Counting multiple long (9+) n-grams in corpora: request for approaches

Reply via email to