I am using 'tm' package for text mining and facing an issue with finding the frequently occuring terms. From the definition it appears that findFreqTerms and minDocFreq are equivalent commands and both tries to identify the documents with terms appearing more than a specified threshold. However, I am getting drastically different results with both. I have given the results from both the commands below:
findFreqTerms identifies 3140 words that appear more than 5 times but minDocFreq identifies only 659 terms. Can someone please explain the reason for the different or whether I have misunderstood their definitions?? >tdm1 <- TermDocumentMatrix(tr1,control=list(weighting=weightBin)) > freq_terms <- findFreqTerms(tdm1, lowfreq =5, highfreq = Inf) > str(freq_terms) chr [1:3140] "abc" "abil" "abl" "abnorm" "abort" "absenc" ... > tdm2 <- TermDocumentMatrix(tr1,control=list(minDocFreq=5,minWordLength=1)) > str(tdm2) List of 6 $ i : int [1:4703] 173 616 624 241 350 534 563 609 129 333 ... $ j : int [1:4703] 1 2 3 7 7 7 7 8 10 10 ... $ v : num [1:4703] 7 5 6 9 5 7 5 5 5 7 ... $ nrow : int 659 $ ncol : int 5677 $ dimnames:List of 2 ..$ Terms: chr [1:659] "\024" "\026" "ac" "access" ... ..$ Docs : chr [1:5677] "1" "2" "3" "4" ... - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix" - attr(*, "Weighting")= chr [1:2] "term frequency" "tf" Thank you. Ravi -- View this message in context: http://r.789695.n4.nabble.com/findFreqTerms-vs-minDocFreq-in-Package-tm-tp3806644p3806644.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.