Maybe ask on the HPC list? [1]

A general tip... you may be running out of memory. If at all possible you need 
to make sure you extract the data subsets in the parent process, and limit the 
amount of environment data passed into the child processes. That is, instead of 
using an integer counter to index into the modtabs list, iterate over the list 
itself so that each process doesnt need a copy of the whole data set.

I know that fork lets you share memory blocks temporarily between the parent 
and child processes, but it is fragile... if you modify the data at all then 
the sharing stops for that memory block and you get an explosion of memory use.

[1] <https://stat.ethz.ch/mailman/listinfo/r-sig-hpc>

On December 10, 2024 4:20:28 PM PST, "Deramus, Thomas Patrick" 
<tdera...@mgb.org> wrote:
>Hi R users.
>
>Apologies for the lack of concrete examples because the dataset is large, and 
>it being so I believe is the issue.
>
>I multiple, very large datasets for which I need to generate 0/1 
>absence/presence columns
>Some include over 200M rows, with two columns that need presence/absence 
>columns based on the strings contained within them, as an example, one set has 
>~29k unique values and the other with ~15k unique values (no overlap across 
>the two).
>
>Using a combination of custom functions:
>
>crewjanitormakeclean <- function(df,columns) {
>  df <- df |> mutate(across(columns, ~make_clean_names(.,  allow_dupes = 
> TRUE)))
>  return(df)
>}
>
>mass_pivot_wider <- function(df,column,prefix) {
>  df <- df |> distinct() |> mutate(n = 1) |> pivot_wider(names_from = 
> glue("{column}"), values_from = n, names_prefix = prefix, values_fill = 
> list(n = 0))
>  return(df)
>}
>
>sum_group_function <- function(df) {
>  df <- df |> group_by(ID_Key) |> 
> summarise(across(c(starts_with("column1_name_"),starts_with("column2_name_"),),
>  ~ sum(.x, na.rm = TRUE))) |> ungroup()
>  return(df)
>}
>
>and splitting up the data into a list of 110k individual dataframes based on 
>Key_ID
>
>temp <-
>  open_dataset(
>    sources = input_files,
>    format = 'csv',
>    unify_schema = TRUE,
>    col_types = schema(
>      "ID_Key" = string(),
>      "column1" = string(),
>      "column1" = string()
>    )
>  ) |> as_tibble()
>
>
>  keeptabs <- split(temp, temp$ID_Key)
>
>
>I used a multicore framework to distribute the `sum` functions across each 
>Key_ID when a multicore argument is enabled.
>
>      
>    if(isTRUE(multicore)){
>      output <- mclapply(1:length(modtabs), function(i) 
> crewjanitormakeclean(modtabs[[i]],c("string_columns_2","string_columns_1")), 
> mc.cores = numcores)
>      output <- mclapply(1:length(modtabs), function(i) 
> mass_pivot_wider(modtabs[[i]],"string_columns_1","col_1_name_"), mc.cores = 
> numcores)
>      output <- mclapply(1:length(modtabs), function(i) 
> mass_pivot_wider(modtabs[[i]],"string_columns_2","col_2_name_"), mc.cores = 
> numcores)
>    }else{
>      output <- lapply(1:length(modtabs), function(i) 
> crewjanitormakeclean(modtabs[[i]],c("string_columns_2","string_columns_1")))
>      output <- lapply(1:length(modtabs), function(i) 
> mass_pivot_wider(modtabs[[i]],"string_columns_1","col_1_name_"))
>      output <- mclapply(1:length(modtabs), function(i) 
> mass_pivot_wider(modtabs[[i]],"string_columns_2","col_2_name_"))
>    }
>
>
>Moving every Key_ID to a single row and then row-binding the data while 
>creating new columns for the differences across `Key_ID`s from the pivot using 
>the following solution (78 upvotes at time of this email):
>https://stackoverflow.com/questions/3402371/combine-two-data-frames-by-rows-rbind-when-they-have-different-sets-of-columns
>
>  allNms <- unique(unlist(lapply(keeptabs, names)))
>
>  output <- do.call(rbind,
>                     c(lapply(keeptabs,
>                              function(x) data.frame(c((x), 
> sapply(setdiff(allNms, names(x)),
>                                                                 function(y) 
> NA))) |> as_tibble()),
>                       make.row.names=FALSE)) |> 
> mutate(across(c(starts_with("column1_name_"), starts_with("column2_name_")), 
> coalesce, 0))
>
>However, I have noticed that the jobs seem to "hang" after a while, with the 
>initial 30 or so (numcores == 30 in the workflow, equal to the number of cores 
>I reserved) at 100% of the requested CPU, and then several "zombie" processes 
>occur and the cores just stop at 0% and never proceed, usually dying with a 
>timeout or not all jobs running to completion failure to join of some kind.
>
>This happens in both base R and RStudio, and I haven't been able to figure out 
>if it's something wrong with the code, the size of the data, or our 
>architecture, but I would appreciate any suggestions as to what I might be 
>able to do about this.
>
>Before they are suggested, I have also tried this same approach with foreach, 
>snow, future, and furr packages, and base parallel with mc.apply seems to be 
>the only thing that works for at least one dataset.
>
>In the event it has something to do with our architecture, here is what we are 
>running on and our loaded packages:
>
>OS Information:
>NAME="Red Hat Enterprise Linux"
>VERSION="9.3 (Plow)"
>ID="rhel"
>ID_LIKE="fedora"
>VERSION_ID="9.3"
>PLATFORM_ID="platform:el9"
>PRETTY_NAME="Red Hat Enterprise Linux 9.3 (Plow)"
>ANSI_COLOR="0;31"
>LOGO="fedora-logo-icon"
>CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
>HOME_URL="https://www.redhat.com/";
>DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9";
>BUG_REPORT_URL="https://bugzilla.redhat.com/";
>REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
>REDHAT_BUGZILLA_PRODUCT_VERSION=9.3
>REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
>REDHAT_SUPPORT_PRODUCT_VERSION="9.3"
>Operating System: Red Hat Enterprise Linux 9.3 (Plow)
>     CPE OS Name: cpe:/o:redhat:enterprise_linux:9::baseos
>          Kernel: Linux 5.14.0-362.13.1.el9_3.x86_64
>    Architecture: x86-64
> Hardware Vendor: Dell Inc.
>  Hardware Model: PowerEdge R840
>Firmware Version: 2.15.1
>
>R Version:
>R.Version()
>$platform
>[1] "x86_64-pc-linux-gnu"
>$arch
>[1] "x86_64"
>$os
>[1] "linux-gnu"
>$system
>[1] "x86_64, linux-gnu"
>$status
>[1] ""
>$major
>[1] "4"
>$minor
>[1] "3.2"
>$year
>[1] "2023"
>$month
>[1] "10"
>$day
>[1] "31"
>$`svn rev`
>[1] "85441"
>$language
>[1] "R"
>$version.string
>[1] "R version 4.3.2 (2023-10-31)"
>$nickname
>[1] "Eye Holes"
>
>RStudio Server Version:
>RStudio 2023.09.1+494 "Desert Sunflower" Release 
>(cd7011dce393115d3a7c3db799dda4b1c7e88711, 2023-10-16) for RHEL 9
>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like 
>Gecko) Chrome/130.0.0.0 Safari/537.36
>
>PPM Repo:
>https://packagemanager.posit.co/cran/__linux__/rhel9/latest
>
>attached base packages:
>[1] parallel  stats     graphics  grDevices datasets  utils     methods   base
>
>other attached packages:
> [1] listenv_0.9.1        microbenchmark_1.5.0 dbplyr_2.4.0         
> duckplyr_0.4.1       readxl_1.4.3         fastDummies_1.7.3
> [7] glue_1.8.0           arrow_14.0.2.1       data.table_1.15.2    
> toolbox_0.1.1        janitor_2.2.0        lubridate_1.9.3
>[13] forcats_1.0.0        stringr_1.5.1        dplyr_1.1.4          
>purrr_1.0.2          readr_2.1.5          tidyr_1.3.1
>[19] tibble_3.2.1         ggplot2_3.5.0        tidyverse_2.0.0      
>duckdb_1.1.2         DBI_1.2.3            fs_1.6.3
>
>
>Happy to provide additional information if it would be helpful.
>
>Thank you in advance!
>The information in this e-mail is intended only for the...{{dropped:23}}

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to