Re: [R] About size of data frames

Rui Barradas Thu, 14 Aug 2025 10:54:57 -0700

On 8/14/2025 12:27 PM, Stefano Sofia via R-help wrote:

Dear R-list users,


let me ask you a very general question about performance of big data frames.

I deal with semi-hourly meteorological data of about 70 sensors during 28 
winter seasons.


It means that for each sensor I have 48 data for each day, 181 days for each 
winter season (182 in case of leap year): 48 * 181 * 28 = 234,576

234,576 * 70 = 16420320


 From the computational point of view it is better to deal with a single data 
frame of approximately 16.5 M rows and 3 columns (one for data, one for sensor 
code and one for value), with a single data frame of approximately 235,000 rows 
and 141 rows or 70 different data frames of approximately 235,000 rows and 3 
rows? Or it doesn't make any difference?

I personally would prefer the first choice, because it would be easier for me 
to deal with a single data frame and few columns.


Thank you for your usual help

Stefano


          (oo)
--oOO--( )--OOo--------------------------------------
Stefano Sofia MSc, PhD
Civil Protection Department - Marche Region - Italy
Meteo Section
Snow Section
Via Colle Ameno 5
60126 Torrette di Ancona, Ancona (AN)
Uff: +39 071 806 7743
E-mail: stefano.so...@regione.marche.it
---Oo---------oO----------------------------------------

________________________________

AVVISO IMPORTANTE: Questo messaggio di posta elettronica pu� contenere 
informazioni confidenziali, pertanto � destinato solo a persone autorizzate 
alla ricezione. I messaggi di posta elettronica per i client di Regione Marche 
possono contenere informazioni confidenziali e con privilegi legali. Se non si 
� il destinatario specificato, non leggere, copiare, inoltrare o archiviare 
questo messaggio. Se si � ricevuto questo messaggio per errore, inoltrarlo al 
mittente ed eliminarlo completamente dal sistema del proprio computer. Ai sensi 
dell'art. Ai sensi dell'art. 2.4 dell'allegato 1 alla DGR n. 74/2021, si 
segnala che, in caso di necessit� ed urgenza, la risposta al presente messaggio 
di posta elettronica pu� essere visionata da persone estranee al destinatario.
IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages to clients of Regione Marche may contain information that is 
confidential and legally privileged. Please do not read, copy, forward, or 
store this message unless you are an intended recipient of it. If you have 
received this message in error, please forward it to the sender and delete it 
completely from your computer system.

        [[alternative HTML version deleted]]


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Hello,

First of all, 48 * 181 * 28 = 243,264, not 234,576.
And 243264 * 70 = 17,028,480.

As for the question, why don't you try it with smaller data sets?

In the test bellow I have tested with the sizes you have posted and themany columns (wide format) is fastest. Then the df's list, then the 4columns (long format).

4 columns because it's sensor, day, season and data.

And the wide format df is only 72 columns wide, one for day, one forseason and one for each sensor.

The test computes mean values aggregated by day and season. When thedata is in the long format it must also include the sensor, so there isan extra aggregation column.

The test is very simple, real results probably depend on the functionsyou want to apply to the data.




# create the test data
makeDataLong <- function(sensor, x) {
  x[["data"]] <- rnorm(nrow(df1))
  cbind.data.frame(sensor, x)
}

makeDataWide <- function(sensor, x) {
  x[[sensor]] <- rnorm(nrow(x))
  x
}

set.seed(2025)

n_per_day <- 48
n_days <- 181
n_seasons <- 28
n_sensors <- 70

day <- rep(1:n_days, each = n_per_day)
season <- 1:n_seasons
sensor_names <- sprintf("sensor_%02d", 1:n_sensors)
df1 <- expand.grid(day = day, season = season, KEEP.OUT.ATTRS = FALSE)

df_list <- lapply(1:n_sensors, makeDataLong, x = df1)
names(df_list) <- sensor_names

df_long <- lapply(1:n_sensors, makeDataLong, x = df1) |> do.call(rbind,args = _)

df_wide <- df1
for(s in sensor_names) {
  df_wide <- makeDataWide(s, df_wide)
}


# test functions
f <- function(x) aggregate(data ~ season + day, data = x, mean)
g <- function(x) aggregate(data ~ sensor + season + day, data = x, mean)
h <- function(x) aggregate(. ~ season + day, x, mean)

# timings
bench::mark(
  list_base = lapply(df_list, f),
  long_base = g(df_long),
  wide_base = h(df_wide),
  check = FALSE
)



Hope this helps,

Rui Barradas

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] About size of data frames

Reply via email to