In looking at the contents of the gzip'ed archives, stripping out the headers does not look trivial, but it appears that it could be done in most cases. A whole other problem is quoted text. Any preference on whether or not that should be included as well? If it is included, the word are not entirely accurate.
Peter On Mon, Oct 4, 2010 at 2:13 PM, Federico Leva (Nemo) <nemow...@gmail.com>wrote: > phoebe ayers, 04/10/2010 17:29: > > This is fun! thanks for doing it. It would be interesting to see a > > version with all of the headers stripped out (dates & email terms: > > mailman/mimedel, etc.) so the content words would really show up. > > If someone tells me how to do what Werdna suggested (I'm not a > progammer)... In the meanwhile, just add stopwords to the talk as phoebe > did: > http://commons.wikimedia.org/wiki/File_talk:Foundation-l_word_cloud.png > > > I > > like that "community" is huge and "individual" is tiny :) > > With 1000 words there are lots of funny things to discover. :-D > > Nemo > > _______________________________________________ > foundation-l mailing list > foundation-l@lists.wikimedia.org > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l > _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l