On 2025-08-14 7:27 a.m., Stefano Sofia via R-help wrote:
Dear R-list users,
let me ask you a very general question about performance of big data frames.
I deal with semi-hourly meteorological data of about 70 sensors during 28
winter seasons.
It means that for each sensor I have 48 data for each day, 181 days for each
winter season (182 in case of leap year): 48 * 181 * 28 = 234,576
234,576 * 70 = 16420320
From the computational point of view it is better to deal with a single data
frame of approximately 16.5 M rows and 3 columns (one for data, one for sensor
code and one for value), with a single data frame of approximately 235,000 rows
and 141 rows or 70 different data frames of approximately 235,000 rows and 3
rows? Or it doesn't make any difference?
I personally would prefer the first choice, because it would be easier for me
to deal with a single data frame and few columns.
It really depends on what computations you're doing. As a general rule,
column operations are faster than row operations. (Also as a general
rule, arrays are faster than dataframes, but are much more limited in
what they can hold: all entries must be the same type, which probably
won't work for your data.)
So I'd guess your 3 column solution would likely be best.
Duncan Murdoch
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.