On 2025-08-14 7:27 a.m., Stefano Sofia via R-help wrote:
Dear R-list users,

let me ask you a very general question about performance of big data frames.

I deal with semi-hourly meteorological data of about 70 sensors during 28 
winter seasons.


It means that for each sensor I have 48 data for each day, 181 days for each 
winter season (182 in case of leap year): 48 * 181 * 28 = 234,576

234,576 * 70 = 16420320


 From the computational point of view it is better to deal with a single data 
frame of approximately 16.5 M rows and 3 columns (one for data, one for sensor 
code and one for value), with a single data frame of approximately 235,000 rows 
and 141 rows or 70 different data frames of approximately 235,000 rows and 3 
rows? Or it doesn't make any difference?

I personally would prefer the first choice, because it would be easier for me 
to deal with a single data frame and few columns.


It really depends on what computations you're doing. As a general rule, column operations are faster than row operations. (Also as a general rule, arrays are faster than dataframes, but are much more limited in what they can hold: all entries must be the same type, which probably won't work for your data.)

So I'd guess your 3 column solution would likely be best.

Duncan Murdoch

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to