Re: [R] aggregating data with quality control

Rui Barradas Sat, 31 Aug 2024 06:41:48 -0700

Às 12:15 de 31/08/2024, Stefano Sofia escreveu:

Dear R-list users,


I deal with semi-hourly data from automatic meteorological stations.

They have to pass a manual validation; suppose that status = "C" stands for correct and 
status = "D" for discarded.

Here a simple example with "Snow height" (HS):


mydf <- data.frame(data_POSIX=seq(as.POSIXct("2024-01-01 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz="Etc/GMT-1"), 
as.POSIXct("2024-01-02 23:30:00", format = "%Y-%m-%d %H:%M:%S", tz="Etc/GMT-1"), by="30 min"))

mydf$hs <- round(runif(96, 0, 100))

mydf$status <- c(rep("C", 50), "S", rep("C", 45))


Evaluating the daily mean indipendently from the status is very easy:

aggregate(mydf$hs, by=list(format(mydf$data_POSIX, "%Y"), format(mydf$data_POSIX, "%m"), 
format(mydf$data_POSIX, "%d")), my.mean)


Things become more complicated when I need to export also the status: this should be "C" when all 48 data 
have status equal to "C", and status "D" when at least one value has status ="D".


I have no clue on how to do that in an efficient way.

Could some of you give me some clues on how to do that?


Thank you for your usual support

Stefano Sofia


          (oo)
--oOO--( )--OOo--------------------------------------
Stefano Sofia PhD
Civil Protection - Marche Region - Italy
Meteo Section
Snow Section
Via del Colle Ameno 5
60126 Torrette di Ancona, Ancona (AN)
Uff: +39 071 806 7743
E-mail: stefano.so...@regione.marche.it
---Oo---------oO----------------------------------------

________________________________

AVVISO IMPORTANTE: Questo messaggio di posta elettronica pu� contenere 
informazioni confidenziali, pertanto � destinato solo a persone autorizzate 
alla ricezione. I messaggi di posta elettronica per i client di Regione Marche 
possono contenere informazioni confidenziali e con privilegi legali. Se non si 
� il destinatario specificato, non leggere, copiare, inoltrare o archiviare 
questo messaggio. Se si � ricevuto questo messaggio per errore, inoltrarlo al 
mittente ed eliminarlo completamente dal sistema del proprio computer. Ai sensi 
dell'art. 6 della DGR n. 1394/2008 si segnala che, in caso di necessit� ed 
urgenza, la risposta al presente messaggio di posta elettronica pu� essere 
visionata da persone estranee al destinatario.
IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages to clients of Regione Marche may contain information that is 
confidential and legally privileged. Please do not read, copy, forward, or 
store this message unless you are an intended recipient of it. If you have 
received this message in error, please forward it to the sender and delete it 
completely from your computer system.

        [[alternative HTML version deleted]]


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Hello,

The aggregate.formula method has a subset argument that you can use toextract only the rows matching a condition. The condition below tells ifthere is any "D" and aggregates based on it.I create a variable subset_condition in order to make the code morereadable.


First data with no "D"


set.seed(2024)

mydf <- data.frame(data_POSIX = seq(as.POSIXct("2024-01-01 00:00:00",format = "%Y-%m-%d %H:%M:%S", tz="Etc/GMT-1"),as.POSIXct("2024-01-02 23:30:00",format = "%Y-%m-%d %H:%M:%S", tz="Etc/GMT-1"), by="30 min"))

mydf$hs <- round(runif(96, 0, 100))
mydf$status <- c(rep("C", 50), "S", rep("C", 45))

my.mean <- function(x, na.rm = TRUE) mean(x, na.rm = na.rm)

aggregate(hs ~ format(mydf$data_POSIX, "%Y-%m-%d"), mydf, my.mean)
#>   format(mydf$data_POSIX, "%Y-%m-%d")       hs
#> 1                          2024-01-01 52.37500
#> 2                          2024-01-02 45.64583

subset_condition <- if(any(mydf$status == "D")) mydf$status == "D" else TRUE

aggregate(hs ~ format(mydf$data_POSIX, "%Y-%m-%d") + status, mydf,my.mean, subset = subset_condition)

#>   format(mydf$data_POSIX, "%Y-%m-%d") status       hs
#> 1                          2024-01-01      C 52.37500
#> 2                          2024-01-02      C 46.48936
#> 3                          2024-01-02      S  6.00000



Now data with "D"'s.


my.mean <- function(x, na.rm = TRUE) mean(x, na.rm = na.rm)

status_with_D <- sample(c('C', 'D'), 45, TRUE, c(.9, .1))
mydf$status <- c(rep("C", 50), "S", status_with_D)

subset_condition <- if(any(mydf$status == "D")) mydf$status == "D" else TRUE

aggregate(hs ~ format(data_POSIX, "%Y-%m-%d") + status, mydf, my.mean,subset = subset_condition)

#>   format(data_POSIX, "%Y-%m-%d") status   hs
#> 1                     2024-01-02      D 51.2

# the formats in the OP but extracted from the date/time and used in theformula that follows.

year <- format(mydf$data_POSIX, "%Y")
month <- format(mydf$data_POSIX, "%m")
day <- format(mydf$data_POSIX, "%d")

aggregate(hs ~ year + month + day, mydf, my.mean)
#>   year month day       hs
#> 1 2024    01  01 52.37500
#> 2 2024    01  02 45.64583

aggregate(hs ~ year + month + day + status, mydf, my.mean, subset =subset_condition)

#>   year month day status   hs
#> 1 2024    01  02      D 51.2



Hope this helps,

Rui Barradas


--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença 
de vírus.
www.avg.com

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] aggregating data with quality control

Reply via email to