[R] Cluster analysis using term frequencies

Sun Shine Tue, 24 Mar 2015 04:58:17 -0700

Hi list

I am using the 'tm' package to review meeting notes at a school toidentify terms frequently associated with 'learning', 'sports', and'extra-mural' activities, and then to sort any terms according to thesethree headers in a way that could be supported statistically (as opposedto, say, my own bias, etc.).


To accomplish this, I have done the following:

(1) After the usual pre-processing of the text data, loading it as acorpus and then converting it into a document term matrix (called'allTerms'), I have identified the 20 most frequently occurring terms inthe meeting notes and extracted these into a named vector called'freqTerms'. Many of the terms returned have nothing to do with any ofthe three themes of 'learning', 'sports', or 'extra-mural'.

(2) Therefore, I have also manually generated a list of terms andsynonyms for 'learning' and 'sports', etc. (e.g. 'football', 'soccer','drama', 'chess', etc.) and then tested for the occurrence of each ofthese terms in the corpus, e.g.:


> allTerms['soccer']

and have come up with a list of some 30 terms together with theirfrequencies. I manually sorted these according to three headers'learning', 'sports', and 'extra-mural' and dropped these into a tablein a word processing document. Some of these terms are also in thefreqTerms vector.

What I want to do now is to use cluster analysis (hclust, from the'cluster' library) to plot a dendrogram of the terms I have manuallychecked and put into the table, in order to see how closely similar theterms are and whether they cluster in ways similar to the way as Imanually sorted these under the table column headers of 'learning','sports', and 'extra-mural'.

To do this, I dropped these manually sorted terms into a data frametogether with the associated values (which I called 'tes.df') and thentried plotting this as follows:


> dtes <- dist(tes.df, method = 'euclidean')
> dtesFreq <- hclust(dtes, method = 'ward.D')
> plot(dtesFreq, labels = names(tes.df))

However, I get an error message when trying to plot this: "Error ingraphics:::plotHclust(n1, merge, height, order(x$order), hang, :invalid dendrogram input".

I'm clearly screwing something up, either in my source data.frame or inmy setting hclust up, but don't know which, nor how.

More than just identifying the error however, I am interested in findinga smart (efficient/ elegant) way of checking the occurrence andfrequency value of the terms that may be associated with 'sports','learning', and 'extra-mural' and extracting these into a matrix or dataframe so that I can analyse and plot their clustering to see if how Iassociated these terms is actually supported statistically.

I'm sure that there must be a way of doing this in R, but I'm obviouslynot going about it correctly. Can anyone shine a light please?


Thanks for any help/ guidance.

Regards,
Sun

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Cluster analysis using term frequencies

Reply via email to