On 04/12/2013 12:08 PM, Nicolas Servant wrote:
Thank you Martin, I will have a look to the SummarizedExperiment class. I'mj
not sure to really understand what you call "a reference class" ? does it
mean that there is no easy way to simply creater a pointer between A and B ?

Practically the answer is no, there is no easy way to create a pointer between A and B.

## Background

R as copy-on-change semantics so actually

    x <- seq(1, 10)
    y <- x

has associated the symbol `y` with the address pointed to by `x` without any memory copy (in this sense `y` is a pointer to `x`), but at the same time marked the memory location at `x` such that if either `x` or `y` is changed, a memory copy occurs. S4 is implemented in a way that makes it difficult to exploit this copy-on-change efficiency.

The 'classic' approach to creating reference semantics (where two symbols actually reference the same data) has been to assign a variable to an environment, and to pass the environment around. An R environment has reference-based semantics.

    e = new.env(parent=emptyenv())
    e[["x"]] = seq(1, 10)
    f = e ## changes to e[["x"]] are reflected in f[["x"]] and vice versa

This approach has been made more formal with so-called reference classes, `?ReferenceClasses`

   .R = setRefClass("R", fields=list(x="matrix"))
    r = .R$new(x=matrix(1:10, 5))
    s = r
    r$x[] = log(r$x[])  ## changes to r$x change s$x

The main draw-back to reference classes is that changing `r` surprises the R user with its 'action-at-a-distance' -- `s` also changes. The user really does not expect the value of one R variable to change when the value of another R variable is changed. Also while the `$` accessor and contained methods of reference classes are more familiar to non-R programmers, they are not necessarily the way R users are expecting to interact with objects (e.g., the S3 object `m = matrix()` is queried with `dim(m)`, not `m$dim()`).

## Implementing copy-on-change reference classes

A solution is to implement a reference class that nonetheless has copy-on-change semantics, so that one gets the benefit of pass-by-reference without the surprise of action-at-a-distance. The implementation can be done so that the user interacts with R objects in a way that is familiar to R users.

The design pattern is actually quite simple; the following uses the convention that `.` prefixes functions that are not meant to be called directly by the user. We start with a function that makes a _shallow_ copy of a reference class, then updates an arbitrary number of reference class fields.

    .clone <-
        function(x, ...)
    {
        args <- list(...)
        x <- x$copy()
        for (nm in names(args))
            x$field(nm, args[[nm]])
        x
    }

The `x$copy()` creates a new instance of the reference class, with references to the original fields. The `for` loop replaces the fields that we've provided, but only in our new copy.

We use `.clone` whenever we want to modify a field of `x`; only the modified fields are copied.

## Example

Here's a reference class representation of your HTCexp class as simplified in your email, with three fields.

    .HTCexp <- setRefClass("HTCexp",
        fields = list(data = "matrix", x_intervals = "GenomicRanges",
          y_intervals = "GenomicRanges"))

A relatively recent addition to R is that `setRefClass` (and `setClass`) return a generator function that can be used to create instances of the class. We use the generator in a function that is exposed to the user, taking the opportunity to provide some hints about expected arguments and to simplify object construction

    HTCexp <-
        function(data = matrix(0, 0, 0), x_intervals = GRanges(),
                 y_intervals = x_intervals, ...)
    {
        .HTCexp$new(data=data, x_intervals = x_intervals,
                    y_intervals = y_intervals, ...)
    }

This is enough to create an object

    htc0 <- HTCexp()
    htc <- HTCexp(matrix(1:25, 5), GRanges("A", IRanges(1:5, width=10)))

We then provide a familiar interface to the class, e.g., with some getters:

    setGeneric("htc_data", function(x, ...) standardGeneric("htc_data"))
    setGeneric("x_intervals", function(x, ...) standardGeneric("x_intervals"))
    setGeneric("y_intervals", function(x, ...) standardGeneric("y_intervals"))

    setMethod("htc_data", "HTCexp", function(x, ...) x$data)
    setMethod("x_intervals", "HTCexp", function(x, ...) x$x_intervals)
    setMethod("y_intervals", "HTCexp", function(x, ...) x$y_intervals)

The user accesses components of `HTCexp` with these S4 methods as `x_intervals(htc)`, but internally we access the data using reference class techniques, e.g, `x$x_intervals`.

If we write a method or function that modifies our object, we need to maintain the illusion of copy-on-change by creating a clone updated with the data that we modify

    setMethod("[", "HTCexp", function(x, i, j, ..., drop=TRUE) {
        .clone(x, data=htc_data(x)[i,j, ...], x_intervals=x_intervals(x)[i],
              y_intervals=y_intervals(x)[j])
    })

Neat, `htc[1:2, 3:4]`. If you think about it, we aren't necessarily reducing the amount of copying involved -- we modified all fields of the object. In practice, I believe that we _have_ substantially reduced copying, for complicated reasons.

Let's implement the `Math` group generic (see `?GroupGenericFunctions`), so that we can call functions like `log` on our `HTCexp` object (I have no idea whether this is sensible in your example...).

    setMethod("Math", "HTCexp", function(x) {
        data <- callGeneric(htc_data(x))
        .clone(x, data=data)
    })

When the user says, e.g., `log(htc)`, we are updating the `data` field of the object, the GRanges in `x_intervals` and `y_intervals` are not copied.

## Design decisions

The design I offered above made the `HTCexp` class itself a copy-on-change reference class. In contrast, `SummarizedExperiment` creates a `ShallowSimpleListAssays` copy-on-change reference class, and uses this as a slot in the plain S4 `SummarizedExperiment`. There is no technical problem with this, and the implementation seemed to be effective -- the 'big' data is in the `Assays` slot, and we've gained significant performance without adding too much complexity. Your own use case was motivated by identical `GRanges` x- and y-intervals. Perhaps the solution for you is to construct a `GRangesRef` class and use these as slots in your HTCexp class?

As an additional note, there are differences between reference and S4 classes. One thing I've stumbled over is that validity method (`setValidity`) are not automatically invoked for reference classes; one approach is to write such methods, and invoke them at appropriate locations (e.g., after object construction in the `HTCexp` function).


Hope this helps; I'd like to add this as a How To page to the Bioconductor web site http://bioconductor.org/developers/ so if there are comments, corrections, etc., please don't hesitate to let me know.

Martin

nicolas >
Le 12 avr. 2013 à 19:44, Martin Morgan <mtmor...@fhcrc.org> a écrit :

On 04/12/2013 10:02 AM, Servant Nicolas wrote:
Dear all,

I have a S4 object (HTCexp from HITC package), composed of one big matrix, and 
two genomicRanges objects, A and B which describe the matrix raws and columns.
I thinking about a way to decrease the memory size of this object.
I also have methods to  get/set the matrix and the two GRanges, namely 
intdata(), x_intervals(), y_intervals().

In case of symetric matrix, the two GRanges can be the same, so I was 
interested in simply creating in this case, a pointer from B to A. How can I do 
it in R please ??
Second, I'm wondering if it exists other matrix-like object optimized for big 
matrix (5000 x 5000). I quicky saw the Matrix object from the CRAN, useful for 
sparse matrix.
Any suggestion would be appreciated !

This is not a super-big object, so perhaps you're running in to problems with 
R's propensity to copy data? An easy solution might be to re-use the 
SummarizedExperiment class, which addresses this issue by placing the 'assays' 
data in a reference class.

    library(GenomicRanges)

    .HTCexp = setClass("HTCexp", contains="SummarizedExperiment",
      representation=representation(y_intervals="GenomicRanges"))

    HTCexp <-
        function(intdata = matrix(0, 0, 0), x_intervals=GRanges(),
                 y_intervals=GRanges(), ...)
    {
        .HTCexp(SummarizedExperiment(intdata, rowData=x_intervals),
                y_intervals=y_intervals, ...)
    }

which already gives

HTCexp()
class: HTCexp
dim: 0 0
exptData(0):
assays(1): ''
rownames: NULL
rowData metadata column names(0):
colnames: NULL
colData names(0):
m <- matrix(0, 5000, 5000,
+             dimnames=list(seq_len(5000), seq_len(5000)))
g <- GRanges("A", IRanges(1:5000, width=0))
HTCexp(m, g, g)
class: HTCexp
dim: 5000 5000
exptData(0):
assays(1): ''
rownames: NULL
rowData metadata column names(0):
colnames(5000): 1 2 ... 4999 5000
colData names(0):

I think you'd need to implement "[" and a 'y_intervals' accessors


    setGeneric("y_intervals", function(x, ...) standardGeneric("y_intervals"))

    setMethod("y_intervals", "HTCexp", function(x, ...) {
        x@y_intervals
    })

    setMethod("[", "HTCexp", function(x, i, j, ..., drop=TRUE) {
        ## not sure that this is complete...
        if (missing(i) && missing(j))
            x
        else {
            se <- as(x, "SummarizedExperiment")
            if (missing(i))
                initialize(x, se[,j], y_intervals=y_intervals(x)[j])
            else if (missing(j))
                initialize(x, se[i,])
            else
                initialize(x, se[i,j], y_intervals=y_intervals(x)[j])
        }
    })

Martin


Thank you
Regards
Nicolas

    [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793


--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to