Re: [R-pkg-devel] Fast Matrix Serialization in R?

Henrik Bengtsson Thu, 09 May 2024 17:31:43 -0700

On Thu, May 9, 2024 at 3:46 PM Simon Urbanek
<simon.urba...@r-project.org> wrote:
>
>
>
> > On 9/05/2024, at 11:58 PM, Vladimir Dergachev <volo...@mindspring.com> 
> > wrote:
> >
> >
> >
> > On Thu, 9 May 2024, Sameh Abdulah wrote:
> >
> >> Hi,
> >>
> >> I need to serialize and save a 20K x 20K matrix as a binary file. This 
> >> process is significantly slower in R compared to Python (4X slower).
> >>
> >> I'm not sure about the best approach to optimize the below code. Is it 
> >> possible to parallelize the serialization function to enhance performance?
> >
> > Parallelization should not help - a single CPU thread should be able to 
> > saturate your disk or your network, assuming you have a typical computer.
> >
> > The problem is possibly the conversion to text, writing it as binary should 
> > be much faster.
> >
>
>
> FWIW serialize() is binary so there is no conversion to text:
>
> > serialize(1:10+0L, NULL)
>  [1] 58 0a 00 00 00 03 00 04 02 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 
> 00
> [26] 00 0d 00 00 00 0a 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04 00 00 
> 00
> [51] 05 00 00 00 06 00 00 00 07 00 00 00 08 00 00 00 09 00 00 00 0a
>
> It uses the native representation so it is actually not as bad as it sounds.
>
> One aspect I forgot to mention in the earlier thread is that if you don't 
> need to exchange the serialized objects between machines with different 
> endianness then avoiding the swap makes it faster. E.g, on Intel (which is 
> little-endian and thus needs swapping):
>
> > a=1:1e8/2
> > system.time(serialize(a, NULL))
>    user  system elapsed
>   2.123   0.468   2.661
> > system.time(serialize(a, NULL, xdr=FALSE))
>    user  system elapsed
>   0.393   0.348   0.742


Would it be worth looking into making xdr=FALSE the default? From
help("serialize"):

xdr: a logical: if a binary representation is used, should a
big-endian one (XDR) be used?
...
As almost all systems in current use are little-endian, xdr = FALSE
can be used to avoid byte-shuffling at both ends when transferring
data from one little-endian machine to another (or between processes
on the same machine). Depending on the system, this can speed up
serialization and unserialization by a factor of up to 3x.

This seems like a low-hanging fruit that could spare the world from
wasting unnecessary CPU cycles.

/Henrik



>
> Cheers,
> Simon
>
> ______________________________________________
> R-package-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel

______________________________________________
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel

Re: [R-pkg-devel] Fast Matrix Serialization in R?

Reply via email to