On 2019-06-28 15:26, Duncan Murdoch wrote:
On 28/06/2019 7:35 a.m., Göran Broström wrote:
Hello,
I have two large data frames, 'liss' (170 million obs, 8 variables) and
'fobb' (52 million obs, 8 variables, same as for 'liss'), and checking
their sizes I get
> object.size(liss)
7477492552 bytes
> object.size(fobb)
2494591736 bytes
Fair enough, but when I save them to disk (saveRDS), the size relation
is reversed: 'fobb.rds' takes up 273 MB while 'liss.rds' uses 146 MB!
I was puzzled by this and thought that I had made a mistake in creating
them, but the only explanation I can find for this is that 'liss'
contains a lot more missing values.
saveRDS() uses compression by default. Compression works best if there
are a lot of repetitive values; every NA is the same, so that would help
compression. Other values may also be repeated.
If you use saveRDS(compress=FALSE), you'll get much larger results,
probably roughly proportional to the object.size() results.
Almost equal to the object.size results: The differences are 2167 bytes
and 2171 bytes, respectively (smaller on disk). Thanks for the explanation!
Göran
Duncan Murdoch
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.