Re: [R] Memory management in R

David Winsemius Sat, 09 Oct 2010 07:49:14 -0700


On Oct 9, 2010, at 9:45 AM, Lorenzo Isella wrote:

Hi David,
I am replying to you and to the other people who provided someinsight into my problems with grepl.
Well, at least we now know that the bug is reproducible.
Indeed it is a strange sequence the one I am postprocessing,probably pathological to some extent, nevertheless the problem isgiven by grepl crushing when a long (but not huge) chunk of repeateddata is loaded has to be acknowledged.Now, my problem is the following: given a potentially long string(or before that a sequence, where every element has been generatedvia the hash function, algo='crc32' of the digest package), how canI, starting from an arbitrary position i along the list, calculatethe shortest substring in the future of i (i.e. the interval i:endof the series) that has not occurred in the past of i (i.e. [1:i-1])?

Maybe you should work on a less convoluted explanation of the test? Orperhaps a couple of compact examples, preferably in R-copy-paste format?

Efficiency is not the main point here, I need to run this code onlyonce to get what I need, but it cannot crush on a 2000-entry string.

My suggestion is to explore other alternatives. (I will admit that Idon't yet fully understand the test that you are applying.) The twothat have occurred to me are Biostrings which I have already mentionedand rle() which I have illustrated the use of but not referenced as anavenue. The Biostrings package is part of bioConductor (part of the Runiverse) although you should be prepared for a coffee break when youinstall it if you haven't gotten at least bioClite already installed.When I installed it last night it had 54 other package dependents alsodownloaded and installed. It seems to me that taking advantage of thecoding resources in the molecular biology domain that are currentlydirected at decoding the information storage mechanism of life mightbe a smart strategy. You have not described the domain you are workingin but I would guess that the "digest" package might be biological inprimary application? So forgive me if I am preaching to the choir.

The rle option also occurred to me but it might take a smarter coderthan I to fully implement it. (But maybe Holtman would be up to it.He's a _lot_ smarter than I.) In your example the long "x" string isfaithfully represented by two aligned vectors, each 197 characters inlength. The long repeat sequence that broke the grepl mechanism arejust one pair of values.

> rle(x)
Run Length Encoding
  lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ...
  values : chr [1:197] "5d64d58a" "ac76183b" "202fbcc4" "78087f5e" ...

So maybe as soon as you got to a bundle that was greater than 1/2 theoverall length (as happened in the "x" case) you could stop, since itcould not have "occurred before".


--
David.

Cheers

Lorenzo


On 10/09/2010 01:30 AM, David Winsemius wrote:

What puzzles me is that the list is not really long (less than 2000
entries) and I have not experienced the same problem even withlonger
lists.


But maybe your loop terminated in them eaarlier/ Someplace between
11*225 and 11*240 the grepping machine gives up:

> eprs <- paste(rep("aaaaaaaaaa", 225), collapse="#")
> grepl(eprs, eprs)
[1] TRUE

> eprs <- paste(rep("aaaaaaaaaa", 240), collapse="#")
> grepl(eprs, eprs)
Error in grepl(eprs, eprs) :
invalid regular expression
'aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaa

In addition: Warning message:
In grepl(eprs, eprs) : regcomp error: 'Out of memory'

The complexity of the problem may depend on the distribution ofvalues.You have a very skewed distribution with the vast majority being inthe

same value as appeared in your error message :

> table(x)
x

12653a6 202fbcc4 48bef8c3 4e084ddc 51f342a4 5d64d58a 78087f5eabddf3d1

1419 299 1 1 1 3 1 1
ac76183b b955be36 c600173a e96f6bbd e9c56275
1 30 5 1 9

And you have 1159 of them in one clump (which would seem to besomewhat

improbably under a random null hypothesis:

> max(rle(x)$lengths)
[1] 1159
> which(rle(x)$lengths == 1159)
[1] 123
> rle(x)$values[123]
[1] "12653a6"

HTH (although I think it means you need to construct a different
implementation strategy);

David.

Many thanks

Lorenzo


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Memory management in R

Reply via email to