From the CRAN task view on natural language processing:
(package) tm provides a comprehensive text mining framework for R. The Journal of Statistical Software article Text Mining Infrastructure in R gives a detailed overview and presents techniques for count-based analysis methods, text clustering, text classification and string kernels.
Worth looking into? -Don At 12:59 PM -0700 7/2/09, Helter Two wrote:
WinXP, R-2.9.1 LS., I have been trying to solve a (for me) tricky issue. No matter what I've tried, I just can't find a way to do this. This is the issue: I have a text file (ansi text) "titles.txt" with lines of text; here is an example of such a file:a brief history of polio vaccines anti-vaccination movements and their interpretations early warning in the light of theories of technological change international mobility among nordic doctoral students land of hope and glory: exploring cochlear implantation in the netherlands making science - between nature and society medical innovations in historical-perspective photographing medicine - images and power in britain and america since 1840 shifts in global immunisation goals (1984-2004): unfinished agendas and mixed results striking the mother lode in science - the importance of age, place, and time technology assessment and the sociopolitics of health technologies the policy of science and technology - evolution of research policy - france, the united-kingdom, the federal-republic-of-germany, japan, the united-states - french vaccine independence, local competences and globalisation: lessons from the history of pertussis vaccines external assessment and conditional financing of research in dutch universities histories of cochlear implantation lock in, the state and vaccine development: lessons from the history of the polio vaccines peerless science - peer-review and united-states science policy technology, science, and obstetric practice - the origins and transformation of cephalopelvimetry the rhetoric and counter-rhetoric of a ''bionic'' technology vaccine innovation and adoption: polio vaccines in the uk, the netherlands and west germany, 1955-1965 <<<<< Some of the lines in such a file are very long (not in this example). The file contains titles and abstracts of scientific articles. In addition to this file, I also have a file "words.txt" that includes a set of words I want to analyze. Part of this file: >>>>> technology technological innovations science policy society history <<<<< What I want is to create a matrix in which cell [i,j] contains the number of times word i (i.e the ith word from "words.txt") appears in line j of "titles.txt". So, for the data above this would yield (barring any typos on my side): 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 This is the precursor to co-word analysis and some basic statistics on these titles and abstracts. I have always had a hard time working with text in R and still have no idea how to achieve the results above. I am probably overlooking something pretty straightforward. But right now, I am completely in the dark. Any help is very much appreciated, Peter Verbeet [[alternative HTML version deleted]] ______________________________________________ [email protected] mailing list https://*stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://*www.*R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- -------------------------------------- Don MacQueen Environmental Protection Department Lawrence Livermore National Laboratory Livermore, CA, USA 925-423-1062 ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

