On 2019-06-28 15:26, Duncan Murdoch wrote:
On 28/06/2019 7:35 a.m., Göran Broström wrote:
Hello,

I have two large data frames, 'liss' (170 million obs, 8 variables) and
'fobb' (52 million obs, 8 variables, same as for 'liss'), and checking
their sizes I get

  > object.size(liss)
7477492552 bytes
  > object.size(fobb)
2494591736 bytes

Fair enough, but when I save them to disk (saveRDS), the size relation
is reversed: 'fobb.rds' takes up 273 MB while 'liss.rds' uses 146 MB!

I was puzzled by this and thought that I had made a mistake in creating
them, but the only explanation I can find for this is that 'liss'
contains a lot more missing values.

saveRDS() uses compression by default.  Compression works best if there are a lot of repetitive values; every NA is the same, so that would help  compression.  Other values may also be repeated.

If you use saveRDS(compress=FALSE), you'll get much larger results, probably roughly proportional to the object.size() results.

Almost equal to the object.size results: The differences are 2167 bytes and 2171 bytes, respectively (smaller on disk). Thanks for the explanation!

Göran


Duncan Murdoch

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to