Re: [R] Very slow using S4 classes

Martin Morgan Mon, 12 Sep 2011 09:56:24 -0700

Hi André...

On 09/12/2011 07:20 AM, André Rossi wrote:

Dear Martin Morgan and Martin Maechler...


Here is an example of the computational time when a slot of a S4 class
is of another S4 class and when it is just one object. I'm sending you
the data file.

Thank you!

Best regards,

André Rossi

############################################################

setClass("SupervisedExample",
     representation(
         attr.value = "ANY",
         target.value = "ANY"
))

setClass("StreamBuffer",
     representation=representation(
         examples = "list", #SupervisedExample
         max.length = "integer"
     ),
     prototype=list(
             max.length = as.integer(10000)
     )
)
b <- new("StreamBuffer")

load("~/Dropbox/dataList2.RData")


For a reproducible example, I guess you have something like

  data <- replicate(10000, new("SupervisedExample"))

b@examples <- data #data is a list of SupervisedExample class.

 > system.time({for (i in 1:100) b@examples[[1]]@attr.value[1] = 2 })

Yes, this is slow. [[<-,S4 is not as clever as [[<-,list and performsextra duplication, including those 10,000 S4 objects it contains.

As before, an improvement is to think in terms of vectors, maybe a'SupervisedExamples' class to act as a collection of examples


setClass("SupervisedExamples",
         representation=representation(
           attr.value = "list",
           target.value = "list"))

setClass("StreamBuffer",
         representation=representation(
           examples="SupervisedExamples"))

SupervisedExamples <-
    function(attr.value=vector("list", n),
             target.value=vector("list", n), n, ...)
{
    new("SupervisedExamples", attr.value=attr.value,
        target.value=target.value, ...)
}

StreamBuffer <-
    function(examples, ...)
{
    new("StreamBuffer", examples=examples, ...)
}

data <- SupervisedExamples(n=100000)

b <- StreamBuffer(data)

I then have

> system.time({for (i in 1:100) data@attr.value[[1]] = 2 })
   user  system elapsed
  1.081   0.013   1.094
> system.time({for (i in 1:100) b@examples@attr.value[[1]] <- 2})
   user  system elapsed
  4.283   0.000   4.295

(note the 10x increase in size); still slower, but this will beamortized when the updates are vectorized, e.g.,


> idx = sample(length(b@examples@attr.value), 100)
> system.time(b@examples@attr.value[idx] <- list(2))
   user  system elapsed
  0.013   0.000   0.014

A further change might be to recognize 'StreamBuffer' as an abstractclass that SupervisedExamples extends


setClass("StreamBuffer",
         representation=representation(
           "VIRTUAL", max.len="integer"),
         prototype=prototype(max.len=100000L),
         validity=function(object) {
             if (obj...@max.len < length(object))
                 "too many elements"
             else TRUE
         })

setMethod(length, "StreamBuffer", function(x) {
    stop("'length' undefined on '", class(x), "'")
})

setClass("SupervisedExamples",
         representation=representation(
           attr.value = "list",
           target.value = "list"),
         contains="StreamBuffer")

setMethod(length, "SupervisedExamples", function(x) {
    length(x@attr.value)
})

SupervisedExamples <-
    function(attr.value=vector("list", n),
             target.value=vector("list", n), n, ...)
{
    new("SupervisedExamples", attr.value=attr.value,
        target.value=target.value, ...)
}

data <- SupervisedExamples(n=100000)

> system.time({for (i in 1:100) data@attr.value[[1]] = 2 })
   user  system elapsed
  1.043   0.014   1.061

Martin Morgan

    user  system elapsed
  16.837   0.108  18.244

 > system.time({for (i in 1:100) data[[1]]@attr.value[1] = 2 })
    user  system elapsed
   0.024   0.000   0.026

############################################################


2011/9/10 Martin Morgan <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>>

    On 09/10/2011 08:08 AM, André Rossi wrote:

        Hi everybody!

        I'm creating an object of a S4 class that has two slots:
        ListExamples, which
        is a list, and idx, which is an integer (as the code below).

        Then, I read a data.frame file with 10000 (ten thousands) of
        lines and 10
        columns, do some pre-processing and, basically, I store each
        line as an
        element of a list in the slot ListExamples of the S4 object.
        However, many
        operations after this take a considerable time.

        Can anyone explain me why dois it happen? Is it possible to
        speed up an
        script that deals with a big number of data (it might be
        data.frame or
        list)?

        Thank you,

        André Rossi

        setClass("Buffer",
             representation=representation(
                 Listexamples = "list",
                 idx = "integer"
             )
        )


    Hi André,

    Can you provide a simpler and more reproducible example, for instance

     > setClass("Buf", representation=representation(__lst="list"))
    [1] "Buf"
     > b=new("Buf", lst=replicate(10000, list(10), simplify=FALSE))
     > system.time({ b@lst[[1]][[1]] = 2 })
       user  system elapsed
      0.005   0.000   0.005

    Generally it sounds like you're modeling the rows as elements of
    Listofelements, but you're better served by modeling the columns
    (lst = replicate(10, integer(10000)), if all of your 10 columns were
    integer-valued, for instance). Also, S4 is providing some measure of
    type safety, and you're undermining that by having your class
    contain a 'list'. I'd go after

    setClass("Buffer",
             representation=representation(
               col1="integer",
               col2="character",
               col3="numeric"
               ## etc.
               ),
             validity=function(object) {
                 nms <- slotNames(object)
                 len <- sapply(nms, function(nm) length(slot(object, nm)))
                 if (1L != length(unique(len)))
    "slots must all be of same length"
                 else TRUE
             })

    Buffer <-
        function(col1, col2, col3, ...)
    {
        new("Buffer", col1=col1, col2=col2, col3=col3, ...)
    }

    Let's see where the inefficiencies are before deciding that this is
    an S4 issue.

    Martin



                [[alternative HTML version deleted]]




        ________________________________________________
        R-help@r-project.org <mailto:R-help@r-project.org> mailing list
        https://stat.ethz.ch/mailman/__listinfo/r-help
        <https://stat.ethz.ch/mailman/listinfo/r-help>
        PLEASE do read the posting guide
        http://www.R-project.org/__posting-guide.html
        <http://www.R-project.org/posting-guide.html>
        and provide commented, minimal, self-contained, reproducible code.



    --
    Computational Biology
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

    Location: M1-B861
    Telephone: 206 667-2793 <tel:206%20667-2793>



--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Very slow using S4 classes

Reply via email to