Re: [R] Very slow using S4 classes

André Rossi Mon, 12 Sep 2011 18:36:55 -0700

Thank you a lot Morgan.  Your suggestion helped me to speed up my code.  But
I still believe that the inefficience is an S4 issue.


Best regards,

André Rossi


2011/9/12 Martin Morgan <mtmor...@fhcrc.org>

> Hi André...
>
>
> On 09/12/2011 07:20 AM, André Rossi wrote:
>
>> Dear Martin Morgan and Martin Maechler...
>>
>> Here is an example of the computational time when a slot of a S4 class
>> is of another S4 class and when it is just one object. I'm sending you
>> the data file.
>>
>> Thank you!
>>
>> Best regards,
>>
>> André Rossi
>>
>> ##############################**##############################
>>
>> setClass("SupervisedExample",
>>     representation(
>>         attr.value = "ANY",
>>         target.value = "ANY"
>> ))
>>
>> setClass("StreamBuffer",
>>     representation=representation(
>>         examples = "list", #SupervisedExample
>>         max.length = "integer"
>>     ),
>>     prototype=list(
>>             max.length = as.integer(10000)
>>     )
>> )
>> b <- new("StreamBuffer")
>>
>> load("~/Dropbox/dataList2.**RData")
>>
>
> For a reproducible example, I guess you have something like
>
>  data <- replicate(10000, new("SupervisedExample"))
>
>
>  b@examples <- data #data is a list of SupervisedExample class.
>>
>>  > system.time({for (i in 1:100) b@examples[[1]]@attr.value[1] = 2 })
>>
>
> Yes, this is slow. [[<-,S4 is not as clever as [[<-,list and performs extra
> duplication, including those 10,000 S4 objects it contains.
>
> As before, an improvement is to think in terms of vectors, maybe a
> 'SupervisedExamples' class to act as a collection of examples
>
> setClass("SupervisedExamples",
>         representation=representation(
>           attr.value = "list",
>           target.value = "list"))
>
> setClass("StreamBuffer",
>         representation=representation(
>           examples="SupervisedExamples")**)
>
> SupervisedExamples <-
>    function(attr.value=vector("**list", n),
>             target.value=vector("list", n), n, ...)
> {
>    new("SupervisedExamples", attr.value=attr.value,
>        target.value=target.value, ...)
> }
>
> StreamBuffer <-
>    function(examples, ...)
> {
>    new("StreamBuffer", examples=examples, ...)
> }
>
> data <- SupervisedExamples(n=100000)
>
> b <- StreamBuffer(data)
>
> I then have
>
> > system.time({for (i in 1:100) data@attr.value[[1]] = 2 })
>   user  system elapsed
>  1.081   0.013   1.094
> > system.time({for (i in 1:100) b@examples@attr.value[[1]] <- 2})
>   user  system elapsed
>  4.283   0.000   4.295
>
> (note the 10x increase in size); still slower, but this will be amortized
> when the updates are vectorized, e.g.,
>
> > idx = sample(length(b@examples@attr.**value), 100)
> > system.time(b@examples@attr.**value[idx] <- list(2))
>   user  system elapsed
>  0.013   0.000   0.014
>
> A further change might be to recognize 'StreamBuffer' as an abstract class
> that SupervisedExamples extends
>
> setClass("StreamBuffer",
>         representation=representation(
>           "VIRTUAL", max.len="integer"),
>         prototype=prototype(max.len=**100000L),
>         validity=function(object) {
>             if (obj...@max.len < length(object))
>                 "too many elements"
>             else TRUE
>         })
>
> setMethod(length, "StreamBuffer", function(x) {
>    stop("'length' undefined on '", class(x), "'")
> })
>
> setClass("SupervisedExamples",
>         representation=representation(
>           attr.value = "list",
>           target.value = "list"),
>         contains="StreamBuffer")
>
> setMethod(length, "SupervisedExamples", function(x) {
>    length(x@attr.value)
> })
>
> SupervisedExamples <-
>    function(attr.value=vector("**list", n),
>             target.value=vector("list", n), n, ...)
> {
>    new("SupervisedExamples", attr.value=attr.value,
>        target.value=target.value, ...)
> }
>
> data <- SupervisedExamples(n=100000)
>
> > system.time({for (i in 1:100) data@attr.value[[1]] = 2 })
>   user  system elapsed
>  1.043   0.014   1.061
>
> Martin Morgan
>
>     user  system elapsed
>>  16.837   0.108  18.244
>>
>>  > system.time({for (i in 1:100) data[[1]]@attr.value[1] = 2 })
>>    user  system elapsed
>>   0.024   0.000   0.026
>>
>> ##############################**##############################
>>
>>
>> 2011/9/10 Martin Morgan <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>>
>>
>>
>>    On 09/10/2011 08:08 AM, André Rossi wrote:
>>
>>        Hi everybody!
>>
>>        I'm creating an object of a S4 class that has two slots:
>>        ListExamples, which
>>        is a list, and idx, which is an integer (as the code below).
>>
>>        Then, I read a data.frame file with 10000 (ten thousands) of
>>        lines and 10
>>        columns, do some pre-processing and, basically, I store each
>>        line as an
>>        element of a list in the slot ListExamples of the S4 object.
>>        However, many
>>        operations after this take a considerable time.
>>
>>        Can anyone explain me why dois it happen? Is it possible to
>>        speed up an
>>        script that deals with a big number of data (it might be
>>        data.frame or
>>        list)?
>>
>>        Thank you,
>>
>>        André Rossi
>>
>>        setClass("Buffer",
>>             representation=representation(
>>                 Listexamples = "list",
>>                 idx = "integer"
>>             )
>>        )
>>
>>
>>    Hi André,
>>
>>    Can you provide a simpler and more reproducible example, for instance
>>
>>     > setClass("Buf", representation=representation(**__lst="list"))
>>    [1] "Buf"
>>     > b=new("Buf", lst=replicate(10000, list(10), simplify=FALSE))
>>     > system.time({ b@lst[[1]][[1]] = 2 })
>>       user  system elapsed
>>      0.005   0.000   0.005
>>
>>    Generally it sounds like you're modeling the rows as elements of
>>    Listofelements, but you're better served by modeling the columns
>>    (lst = replicate(10, integer(10000)), if all of your 10 columns were
>>    integer-valued, for instance). Also, S4 is providing some measure of
>>    type safety, and you're undermining that by having your class
>>    contain a 'list'. I'd go after
>>
>>    setClass("Buffer",
>>             representation=representation(
>>               col1="integer",
>>               col2="character",
>>               col3="numeric"
>>               ## etc.
>>               ),
>>             validity=function(object) {
>>                 nms <- slotNames(object)
>>                 len <- sapply(nms, function(nm) length(slot(object, nm)))
>>                 if (1L != length(unique(len)))
>>    "slots must all be of same length"
>>                 else TRUE
>>             })
>>
>>    Buffer <-
>>        function(col1, col2, col3, ...)
>>    {
>>        new("Buffer", col1=col1, col2=col2, col3=col3, ...)
>>    }
>>
>>    Let's see where the inefficiencies are before deciding that this is
>>    an S4 issue.
>>
>>    Martin
>>
>>
>>
>>                [[alternative HTML version deleted]]
>>
>>
>>
>>
>>        ______________________________**__________________
>>        R-help@r-project.org <mailto:R-help@r-project.org> mailing list
>>
>>        
>> https://stat.ethz.ch/mailman/_**_listinfo/r-help<https://stat.ethz.ch/mailman/__listinfo/r-help>
>>        
>> <https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
>> >
>>        PLEASE do read the posting guide
>>        
>> http://www.R-project.org/__**posting-guide.html<http://www.R-project.org/__posting-guide.html>
>>        
>> <http://www.R-project.org/**posting-guide.html<http://www.R-project.org/posting-guide.html>
>> >
>>        and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>>    --
>>    Computational Biology
>>    Fred Hutchinson Cancer Research Center
>>    1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>>    Location: M1-B861
>>    Telephone: 206 667-2793 <tel:206%20667-2793>
>>
>>
>>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Very slow using S4 classes

Reply via email to