Thanks, the pointer to the tokenizer helped. ------------------------------------------------------------ Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine
15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail "The real problem is not whether machines think but whether men do." -- B. F. Skinner ****************************************************************** On Thu, Aug 13, 2009 at 6:11 PM, Ingo Feinerer <feine...@logic.at> wrote: > On Thu, Aug 13, 2009 at 03:36:22PM -0400, Mark Kimpel wrote: > > I am using the package "tm" for text-mining of abstracts and would like > to use > > it to find instances of gene names that may contain white space. For > instance > > "gene regulatory protein 1". The default behavior of tm is to parse this > into 4 > > separate words, but I would like to use the class constructor > "dictionary" to > > define phrases such as just mentioned. > > > > Is this possible? If so, how? > > Yes. > > * In case you only need to find instances, you could use full text > search on your corpus, e.g. > > R> tmIndex(yourCorpus, "gene regulatory protein 1") > > would return the indices of all documents in your corpus containing > this phrase. > > * If you need tokens (in a term-document matrix) of length 4, you could > use a n-gram tokenizer (n = 4). See e.g., > http://tm.r-forge.r-project.org/faq.html#Bigrams. Then you can use > the dictionary argument to store only your selection of gene > names. I.e., something like > > R> yourTokenizer <- function(x) RWeka::NGramTokenizer(x, Weka_control(min > = 4, max = 4)) > R> TermDocumentMatrix(crude, control = list(tokenize = yourTokenizer, > dictionary = yourDictionary)) > > where yourDictionary contains the gene names (a character vector > suffices) to be included in the term-document matrix. > > * If you want to extract arbitrary patterns of different length that > could match some gene names (and build a dictionary from that), you > need some custom functionality. Regular expressions might be a good > starting point ... > > Best regards, Ingo > > -- > Ingo Feinerer > Vienna University of Technology > http://www.dbai.tuwien.ac.at/staff/feinerer > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.