On 07-Feb-10 01:06:40, Ben Bolker wrote: > Jim Lemon <jim <at> bitwrit.com.au> writes: > >> >> On 02/06/2010 06:57 PM, Charlotte Maia wrote: >> > Hey all, >> > >> > Does anyone know if there are any R packages with a copy of the KJV? >> > I'm guessing the answer is no... >> > >> > So the next question, and the more important one is: >> > Does anyone think it would be useful (e.g. for text-mining >> > purposes)? >> > I know almost nothing about theology, >> > so I'm not sure what kind of questions theologists might have (that >> > R >> > could answer). >> > >> > An alternative, that would achieve a similar result (I think), >> > would be an R interface to another open source system, such as >> > Sword. >> > >> Hi Charlotte, >> Try >> >> http://www.gutenberg.org/etext/10 >> >> Jim >> > > I couldn't help it: > > x <- url("http://www.gutenberg.org/dirs/etext90/kjv10.txt",open="r") > X <- readLines(x,n=20000) > z <- grep("First Book of Moses",X) > X <- X[-(1:z)] > X <- X[nchar(X)>0] > length(X) ## 15058 > words <- tolower(unlist(strsplit(X,"[ .,:;()]"))) > words2 <- grep("[^0-9]",words,value=TRUE) > tt <- rev(sort(table(words2))) > barplot(rev(tt[1:100]),horiz=TRUE,las=1,cex.names=0.4,log="x")
Delightful! And fascinating in the detail too. length(tt) # [1] 5078 with slight changes like: barplot(rev(tt[1:50]),horiz=TRUE,las=1,cex.names=0.6,log="x") # ... barplot(rev(tt[101:150]),horiz=TRUE,las=1,cex.names=0.6,log="x") # ... and see the likes of tt["lord"] # lord # 1939 tt["god"] # god # 822 tt["men"] # men # 204 tt["women"] # women # 26 I'm now wondering how it matches up with Zipf's Law (or perhaps Fisher's logarithmic ... ) Thanks, Ben! Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 07-Feb-10 Time: 08:28:30 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.