On Oct 9, 2010, at 4:23 PM, Lorenzo Isella wrote:
My suggestion is to explore other alternatives. (I will admit that I
don't yet fully understand the test that you are applying.)
Hi,
I am trying to partially implement the Lempel Ziv compression
algorithm.
The point is that compressibility and entropy of a time series are
related, hence my final goal is to evaluate the entropy of a time
series.
You can find more at
http://bit.ly/93zX4T
http://en.wikipedia.org/wiki/LZ77_and_LZ78
http://bit.ly/9NgIFt
The two that
have occurred to me are Biostrings which I have already mentioned and
rle() which I have illustrated the use of but not referenced as an
avenue. The Biostrings package is part of bioConductor (part of the R
universe) although you should be prepared for a coffee break when you
install it if you haven't gotten at least bioClite already installed.
When I installed it last night it had 54 other package dependents
also
downloaded and installed. It seems to me that taking advantage of the
coding resources in the molecular biology domain that are currently
directed at decoding the information storage mechanism of life
might be
a smart strategy. You have not described the domain you are working
in
but I would guess that the "digest" package might be biological in
primary application? So forgive me if I am preaching to the choir.
The rle option also occurred to me but it might take a smarter coder
than I to fully implement it. (But maybe Holtman would be up to it.
He's
a _lot_ smarter than I.) In your example the long "x" string is
faithfully represented by two aligned vectors, each 197 characters in
length. The long repeat sequence that broke the grepl mechanism are
just
one pair of values.
> rle(x)
Run Length Encoding
lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ...
values : chr [1:197] "5d64d58a" "ac76183b" "202fbcc4" "78087f5e" ...
So maybe as soon as you got to a bundle that was greater than 1/2 the
overall length (as happened in the "x" case) you could stop, since it
could not have "occurred before".
I doubt that rle() can be deployed to replace Lempel-Ziv (LZ)
algorithm in a trivial way. As a less convoluted example, consider
the series
x <- c("d","a","b","d","a","b","e","z")
If i=4 and therefore the i-th element is the second 'd' in the
series, the shortest series starting from i=4 that I do not see in
the past of 'd' is
"d","a","b","e", whose length is equal to 4 and that is the value
returned by the function below.
The frustrating thing is that I already have the tools I need, just
they crash for reasons beyond my control on relatively short series.
If anyone can make the function below more robust, that is really a
big help for me.
I already offered the Biostrings package. It provides more robust
methods for string matching than does grepl. Is there a reason that
you choose not to?
--
David.
Cheers
Lorenzo
###########################################################
entropy_lz <- function(x,i){
past <- x[1:i-1]
n <- length(x)
lp <- length(past)
future <- x[i:n]
go_on <- 1
count_len <- 0
past_string <- paste(past, collapse="#")
while (go_on>0){
new_seq <- x[i:(i+count_len)]
fut_string <- paste(new_seq, collapse="#")
count_len <- count_len+1
if (grepl(fut_string,past_string)!=1){
go_on <- -1
}
}
return(count_len)
}
x <- c("c","a","b","c","a","b","e","z")
S <- entropy_lz(x,4)
David Winsemius, MD
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.