[R] findFreqTerms vs minDocFreq in Package 'tm'

vioravis Sun, 11 Sep 2011 23:45:19 -0700

I am using 'tm' package for text mining and facing an issue with finding the
frequently occuring terms. From the definition it appears that findFreqTerms
and minDocFreq are equivalent commands and both tries to identify the
documents with terms appearing more than a specified threshold. However, I
am getting drastically different results with both. I have given the results
from both the commands below:


findFreqTerms identifies 3140 words that appear more than 5 times but
minDocFreq identifies only 659 terms. Can someone please explain the reason
for the different or whether I have misunderstood their definitions??


>tdm1 <- TermDocumentMatrix(tr1,control=list(weighting=weightBin))
> freq_terms <- findFreqTerms(tdm1, lowfreq =5, highfreq = Inf) 
> str(freq_terms)
 chr [1:3140] "abc" "abil" "abl" "abnorm" "abort" "absenc" ...


> tdm2 <- TermDocumentMatrix(tr1,control=list(minDocFreq=5,minWordLength=1))
> str(tdm2)
List of 6
 $ i       : int [1:4703] 173 616 624 241 350 534 563 609 129 333 ...
 $ j       : int [1:4703] 1 2 3 7 7 7 7 8 10 10 ...
 $ v       : num [1:4703] 7 5 6 9 5 7 5 5 5 7 ...
 $ nrow    : int 659
 $ ncol    : int 5677
 $ dimnames:List of 2
  ..$ Terms: chr [1:659] "\024" "\026" "ac" "access" ...
  ..$ Docs : chr [1:5677] "1" "2" "3" "4" ...
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"


Thank you.

Ravi



--
View this message in context: 
http://r.789695.n4.nabble.com/findFreqTerms-vs-minDocFreq-in-Package-tm-tp3806644p3806644.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] findFreqTerms vs minDocFreq in Package 'tm'

Reply via email to