Thanks, the pointer to the tokenizer helped.
------------------------------------------------------------
Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail

"The real problem is not whether machines think but whether men do." -- B.
F. Skinner
******************************************************************


On Thu, Aug 13, 2009 at 6:11 PM, Ingo Feinerer <feine...@logic.at> wrote:

> On Thu, Aug 13, 2009 at 03:36:22PM -0400, Mark Kimpel wrote:
> > I am using the package "tm" for text-mining of abstracts and would like
> to use
> > it to find instances of gene names that may contain white space. For
> instance
> > "gene regulatory protein 1". The default behavior of tm is to parse this
> into 4
> > separate words, but I would like to use the class constructor
> "dictionary" to
> > define phrases such as just mentioned.
> >
> > Is this possible? If so, how?
>
> Yes.
>
> * In case you only need to find instances, you could use full text
>  search on your corpus, e.g.
>
>  R> tmIndex(yourCorpus, "gene regulatory protein 1")
>
>  would return the indices of all documents in your corpus containing
>  this phrase.
>
> * If you need tokens (in a term-document matrix) of length 4, you could
>  use a n-gram tokenizer (n = 4). See e.g.,
>  http://tm.r-forge.r-project.org/faq.html#Bigrams. Then you can use
>  the dictionary argument to store only your selection of gene
>  names. I.e., something like
>
>  R> yourTokenizer <- function(x) RWeka::NGramTokenizer(x, Weka_control(min
> = 4, max = 4))
>  R> TermDocumentMatrix(crude, control = list(tokenize = yourTokenizer,
> dictionary = yourDictionary))
>
>  where yourDictionary contains the gene names (a character vector
>  suffices) to be included in the term-document matrix.
>
> * If you want to extract arbitrary patterns of different length that
>  could match some gene names (and build a dictionary from that), you
>  need some custom functionality. Regular expressions might be a good
>  starting point ...
>
> Best regards, Ingo
>
> --
> Ingo Feinerer
> Vienna University of Technology
> http://www.dbai.tuwien.ac.at/staff/feinerer
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to