On Thu, May 9, 2024 at 3:46 PM Simon Urbanek <simon.urba...@r-project.org> wrote: > > > > > On 9/05/2024, at 11:58 PM, Vladimir Dergachev <volo...@mindspring.com> > > wrote: > > > > > > > > On Thu, 9 May 2024, Sameh Abdulah wrote: > > > >> Hi, > >> > >> I need to serialize and save a 20K x 20K matrix as a binary file. This > >> process is significantly slower in R compared to Python (4X slower). > >> > >> I'm not sure about the best approach to optimize the below code. Is it > >> possible to parallelize the serialization function to enhance performance? > > > > Parallelization should not help - a single CPU thread should be able to > > saturate your disk or your network, assuming you have a typical computer. > > > > The problem is possibly the conversion to text, writing it as binary should > > be much faster. > > > > > FWIW serialize() is binary so there is no conversion to text: > > > serialize(1:10+0L, NULL) > [1] 58 0a 00 00 00 03 00 04 02 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 > 00 > [26] 00 0d 00 00 00 0a 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04 00 00 > 00 > [51] 05 00 00 00 06 00 00 00 07 00 00 00 08 00 00 00 09 00 00 00 0a > > It uses the native representation so it is actually not as bad as it sounds. > > One aspect I forgot to mention in the earlier thread is that if you don't > need to exchange the serialized objects between machines with different > endianness then avoiding the swap makes it faster. E.g, on Intel (which is > little-endian and thus needs swapping): > > > a=1:1e8/2 > > system.time(serialize(a, NULL)) > user system elapsed > 2.123 0.468 2.661 > > system.time(serialize(a, NULL, xdr=FALSE)) > user system elapsed > 0.393 0.348 0.742
Would it be worth looking into making xdr=FALSE the default? From help("serialize"): xdr: a logical: if a binary representation is used, should a big-endian one (XDR) be used? ... As almost all systems in current use are little-endian, xdr = FALSE can be used to avoid byte-shuffling at both ends when transferring data from one little-endian machine to another (or between processes on the same machine). Depending on the system, this can speed up serialization and unserialization by a factor of up to 3x. This seems like a low-hanging fruit that could spare the world from wasting unnecessary CPU cycles. /Henrik > > Cheers, > Simon > > ______________________________________________ > R-package-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-package-devel ______________________________________________ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel