Laurent Jégou wrote:
> Hello list members, i'd like to calculate the Moran I on a large dataset,
> a 8640x3432 grid of values.
>
> When i try to create the neighbours list with the cell2nb function, on
> such a scale, R works for several hours and seems to crash. I'm using the last
> version (2.13), 64 bits, on a mac pro with 4go of memory.
>
> I think that i'm doing it wrong :-)
>
> I'd like to use a "moving window" weight matrix instead of a full scale
> one, but i was not lucky enough to find a solution in the docs.
I confronted this problem recently, and decided to roll my own. For a large
dataset, an exhaustive computation of Moran's I is very time-consuming, as you
say; but an estimate of it, based on a subset of pairs chosen randomly, seemed
(for my data, at least) to converge quite nicely. Here's my code. It assumes
a square grid, so you'll need to adapt it for your non-square grid, but that
should be trivial. To determine the right number of pairs for a good estimate
in my application, I called this function with various numbers of pairs, and
plotted the estimate produced as a function of the number of pairs used, and
observed good convergence by 200,000 pairs. You will probably want to do this
yourself, for your dataset. 200,000 pairs took just a second or two to run, on
my machine, so this is much, much faster than an exhaustive calculation. As a
bonus, this code also gives you Geary's C. Oh, and the code below assumes a
wraparound lattice (i.e. a torus, i.e. periodic boundaries), which may not be
what you want; just get rid of the d_wraparound stuff and the following pmax()
call if you want non-wraparound boundaries, and that should work fine, I think.
Not sure what you mean by the moving window thing, so I leave that as an
exercise for the reader. :->
I've never made an R package, and I'm not sure there's much demand for this
code (is there?), so I'm presently unlikely to package it up. However, if
someone who owns a package related to this problem would like to adopt this
code, generalize it to non-square lattices, add a flag to choose periodic or
non-periodic boundaries, etc., feel free to do so; I hereby place it in the
public domain. I'd just like to hear about it if someone does this, and
receive credit somewhere, that's all. I'm not super-good in R, either, so if
there's a better way to do some things, like the somewhat hacked random index
generation (trying to avoid floating point issues when generating a random
integer), please let me know. I'm always interested in learning how to do
things better.
And of course this code is provided without warranty, may have bugs, etc.
Enjoy!
Ben Haller
McGill University
correlation_stats <- function(p, n_pairs=200000)
{
# Compute Moran's I and Geary's C for the landscape p. This is tricky
because the exact calculation
# would be far too time-consuming, as it involves comparison of every
point to every other point.
# So I am trying to estimate it here with a small subset of all
possible point comparisons.
p_size <- NROW(p) # 512
p_length <- length(p) # 262144
mean_p <- mean(p)
pv <- as.vector(p)
# select points and look up their values; for speed, points are
selected by their vector index
p1ix <- floor(runif(n_pairs, min=0, max=p_length - 0.001))
p1x <- (p1ix %/% p_size) + 1
p1y <- (p1ix %% p_size) + 1
p1ix <- p1ix + 1
p2ix <- floor(runif(n_pairs, min=0, max=p_length - 0.001))
p2x <- (p2ix %/% p_size) + 1
p2y <- (p2ix %% p_size) + 1
p2ix <- p2ix + 1
v1 <- pv[p1ix] - mean_p
v2 <- pv[p2ix] - mean_p
# Calculate distances between point pairs, both directly and wrapping
around
# The average direct distance is much smaller than the average
wraparound distance,
# because points near the center vertically have a maximum direct
distance near 256,
# but a minimum wraparound distance near 256. The Moran's I estimate
from wraparound
# distances is different, as a result. Rupert recommends taking the
shorter of the
# two distances, whichever it is, because that keeps us within the
meaningful scale
# of our space, before it just starts repeating due to periodicity.
d_direct <- 1 / sqrt(((p1x - p2x) ^ 2) + ((p1y - p2y) ^ 2))
d_direct[is.infinite(d_direct)] <- 0
d_wraparound <- 1 / sqrt(((p1x - p2x) ^ 2) + ((p_size - abs(p1y - p2y))
^ 2))
d_wraparound[is.infinite(d_wraparound)] <- 0
d <- pmax(d_direct, d_wraparound) # max because we want the min
distance, and these are 1/distance
# precalculate some shared terms
sum_d <- sum(d)
sum_v1_sq <- sum(v1 ^ 2)
# Moran's I: -1 is perfect dispersion, 1 is perfect correlation, 0 is a
random spatial pattern
M_I <- (n_pairs / sum_d) * (sum(d * v1 * v2) / ((sum_v1_sq + sum(v2 ^
2)) / 2))
# (n_pairs / sum(d_direct)) * (sum(d_direct * v1 * v2) / ((sum(v1 ^ 2)
+ sum(v2 ^ 2)) / 2))
# (n_pairs / sum(d_wraparound)) * (sum(d_wraparound * v1 * v2) /
((sum(v1 ^ 2) + sum(v2 ^ 2)) / 2))
# Geary's C: 0 is positive spatial autocorrelation, 2 is negative
spatial autocorrelation, 1 is a random spatial pattern
G_C <- ((n_pairs - 1) * sum(d * ((v1 - v2) ^ 2))) / (2 * sum_d *
sum_v1_sq)
return(list(Moran_I=M_I, Geary_C=G_C))
}
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.