I apologize for not being clear. d is a character vector of length 158908. Each element in the vector has been designated by sentDetect (package: openNLP) as a sentence. Some of these are really sentences. Others are merely groups of meaningless characters separated by white space. strapply is a function in the package gosubfn. It applies to each element of the first argument the regular expression (second argument). Every match is then sent to the designated function (third argument, in my case missing, hence the identity function). Thus, with strapply I am simply performing a white-space tokenization of each sentence. I am doing this in the hope of being able to distinguish true sentences from false ones on the basis of mean length of token, maximum length of token, or similar.

Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland

Tel.:  +41 61 331 10 47
Email:  richard....@pueo-owl.ch


On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:



richard....@pueo-owl.ch wrote:
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this
is a Mac-specific problem.
I have a very large (158,908 possible sentences, ca. 58 MB) plain text
document d which I am
trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
encountering the following error:


What is strapply() and what is d?

Uwe Ligges




Error in base::gsub(pattern, rs, x, ...) :
 Calloc could not allocate (-1398215180 of 1) memory
This happens regardless of whether I run in 32- or 64-bit mode.  The
machine has 8 GB of RAM, so
I can hardly believe that RAM is a problem.
Thanks,
Richard
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to