It's Redhat Enterprise Linux 9

Specifically:
OS Information:
NAME="Red Hat Enterprise Linux"
VERSION="9.3 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.3 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/";
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9";
BUG_REPORT_URL="https://bugzilla.redhat.com/";
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.3
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.3"
Operating System: Red Hat Enterprise Linux 9.3 (Plow)
     CPE OS Name: cpe:/o:redhat:enterprise_linux:9::baseos
          Kernel: Linux 5.14.0-362.13.1.el9_3.x86_64
    Architecture: x86-64
 Hardware Vendor: Dell Inc.
  Hardware Model: PowerEdge R840
Firmware Version: 2.15.1


Regarding RAM restrictions, here are the specs:
               total        used        free      shared  buff/cache   available
Mem:           753Gi        70Gi       600Gi       2.9Gi        89Gi       683Gi
Swap:          4.0Gi       2.5Gi       1.5Gi

It's a multi-user server so naturally things fluctuate.

Regarding possible CPU restrictions, here are the specs of our server:

    Thread(s) per core:  2
    Core(s) per socket:  20
    Socket(s):           4
    Stepping:            4
    CPU(s) scaling MHz:  50%
    CPU max MHz:         3700.0000
    CPU min MHz:         1000.0000



________________________________
From: Gregg Powell <g.a.pow...@protonmail.com>
Sent: Wednesday, December 11, 2024 11:41 AM
To: Deramus, Thomas Patrick <tdera...@mgb.org>
Cc: r-help@r-project.org <r-help@r-project.org>
Subject: Re: [R] Cores hang when calling mcapply

Thomas,
I'm curious - what OS are you running this on, and how much memory does the 
computer have?

Let me know if that code worked out as I hoped.

regards,
gregg

On Wednesday, December 11th, 2024 at 6:51 AM, Deramus, Thomas Patrick 
<tdera...@mgb.org> wrote:
About to try this implementation.

As a follow-up, this is the exact error:

Lost warning messages
Error: no more error handlers available (recursive errors?); invoking 'abort' 
restart
Execution halted
Error: cons memory exhausted (limit reached?)
Error: cons memory exhausted (limit reached?)
Error: cons memory exhausted (limit reached?)
Error: cons memory exhausted (limit reached?)

________________________________
From: Gregg Powell <g.a.pow...@protonmail.com>
Sent: Tuesday, December 10, 2024 7:52 PM
To: Deramus, Thomas Patrick <tdera...@mgb.org>
Cc: r-help@r-project.org <r-help@r-project.org>
Subject: Re: [R] Cores hang when calling mcapply

Hello Thomas,

Consider that the primary bottleneck may be tied to memory usage and the 
complexity of pivoting extremely large datasets into wide formats with tens of 
thousands of unique values per column. Extremely large expansions of columns 
inherently stress both memory and CPU, and splitting into 110k separate data 
frames before pivoting and combining them again is likely causing resource 
overhead and system instability.

Perhaps, evaluate if the presence/absence transformation can be done in a more 
memory-efficient manner without pivoting all at once. Since you are dealing 
with extremely large data, a more incremental or streaming approach may be 
necessary. Instead of splitting into thousands of individual data frames and 
trying to pivot each in parallel, consider instead  a method that processes 
segments of data to incrementally build a large sparse matrix or a compressed 
representation, then combine results at the end.

It's probbaly better to move away from `pivot_wider()` on a massive scale and 
attempt a data.table-based approach, which is often more memory-efficient and 
faster for large-scale operations in R.


An alternate way would be data.table�s `dcast()` can handle large data more 
efficiently, and data.table�s in-memory operations often reduce overhead 
compared to tidyverse pivoting functions.

Also - consider using data.table�s `fread()` or `arrow::open_dataset()` 
directly with `as.data.table()` to keep everything in a data.table format. For 
example, you can do a large `dcast()` operation to create presence/absence 
columns by group. If your categories are extremely large, consider an approach 
that processes categories in segments as I mentioned earlier -  and writes 
intermediate results to disk, then combines/mergesresults at the end.

Limit parallelization when dealing with massive reshapes. Instead of trying to 
parallelize the entire pivot across thousands of subsets, run a single 
parallelized chunking approach that processes manageable subsets and writes out 
intermediate results (for example... using `fwrite()` for each subset). After 
processing, load and combine these intermediate results. This manual segmenting 
approach can circumvent the "zombie" processes you mentioned - that I think 
arise from overly complex parallel nesting and excessivememory utilization.

If the presence/absence indicators are ultimately sparse (many zeros and few 
ones), consider storing the result in a sparse matrix format (for exapmple- 
`Matrix` package in R). Instead of creating thousands of columns as dense 
integers, using a sparse matrix representation should dramatically reduce 
memory. After processing the data into a sparse format, you can then save it in 
a suitable file format and only convert to a dense format if absolutely 
necessary.

Below is a reworked code segment using data.table for a more scalable approach. 
Note that this is a conceptual template. In practice, adapt the chunk sizes and 
filtering operations to your workflow. The idea is to avoid creating 110k 
separate data frames and to handle the pivot in a data.table manner that�s more 
robust and less memory intensve. Here, presence/absence encoding is done by 
grouping and casting directly rather than repeatedly splitting and row-binding.

> library(data.table)
> library(arrow)
>
> # Step A: Load data efficiently as data.table
> dt <- as.data.table(
>   open_dataset(
>    sources = input_files,
>    format = 'csv',
>    unify_schema = TRUE,
>    col_types = schema(
>      "ID_Key" = string(),
>      "column1" = string(),
>      "column2" = string()
>    )
>  ) |>

>    collect()
> )
>
> # Step B: Clean names once
> # Assume `crewjanitormakeclean` essentially standardizes column names
> dt[, column1 := janitor::make_clean_names(column1, allow_dupes =

> TRUE)]
> dt[, column2 := janitor::make_clean_names(column2, allow_dupes =

>  TRUE)]
>
> # Step C: Create presence/absence indicators using data.table
> # Use dcast to pivot wide. Set n=1 for presence, 0 for absence.
> # For large unique values, consider chunking if needed.
> out1 <- dcast(dt[!is.na(column1)], ID_Key ~ column1, fun.aggregate =

> length, value.var = "column1")
> out2 <- dcast(dt[!is.na(column2)], ID_Key ~ column2, fun.aggregate =

> length, value.var = "column2")
>
> # Step D: Merge the two wide tables by ID_Key
> # Fill missing columns with 0 using data.table on-the-fly operations
> all_cols <- unique(c(names(out1), names(out2)))
> out1_missing <- setdiff(all_cols, names(out1))
> out2_missing <- setdiff(all_cols, names(out2))
>
> # Add missing columns with 0
> for (col in out1_missing) out1[, (col) := 0]
> for (col in out2_missing) out2[, (col) := 0]
>
> # Ensure column order alignment if needed
> setcolorder(out1, all_cols)
> setcolorder(out2, all_cols)
>
> # Combine by ID_Key (since they share same columns now)
> final_dt <- rbindlist(list(out1, out2), use.names = TRUE, fill = TRUE)
>
> # Step E: If needed, summarize across ID_Key to sum presence

> indicators
> final_result <- final_dt[, lapply(.SD, sum, na.rm = TRUE), by =

> ID_Key, .SDcols = setdiff(names(final_dt), "ID_Key")]
>
> # note that final_result should now contain summed presence/absence

> (0/1) indicators.




Hope this helps!
gregg
somewhereinArizona

The information in this e-mail is intended only for the person to whom it is 
addressed.  If you believe this e-mail was sent to you in error and the e-mail 
contains patient information, please contact the Mass General Brigham 
Compliance HelpLine at 
https://www.massgeneralbrigham.org/complianceline<https://secure-web.cisco.com/100Hi_g3yJTBAxxYPpHz7NQcwx9A06rN2U4Dh4wHDMTBLGJ5yiq8PszsdEMlRsD9C7ESGkM_88I-b3jy2QGu-x35cEfVZ6QUW9Uf8aQihVfQOceLc1DZr3AXcvDUKCpPFXgrBqSOYdHyh31yQD3--3ltk0pjgK_te1I0M6i_pIUEbdE334rYMuIxAhmwI48EHf8_k9wuSXxBrgiQc9D6Lsakb1w2RckPEmz4DmrbjWfLR4hx3ylUbDTf9UR_eWYxzvivxt6NpRfEN3T9WpUjWHHTNIdUphSKwmR8Sk8i4-KqcpJdPjHKC1185wd1Sr78X/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline>
 .


Please note that this e-mail is not secure (encrypted).  If you do not wish to 
continue communication over unencrypted e-mail, please notify the sender of 
this message immediately.  Continuing to send or respond to e-mail after 
receiving this message means you understand and accept this risk and wish to 
continue to communicate over unencrypted e-mail.

The information in this e-mail is intended only for the person to whom it is 
addressed.  If you believe this e-mail was sent to you in error and the e-mail 
contains patient information, please contact the Mass General Brigham 
Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline 
<https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted).  If you do not wish to 
continue communication over unencrypted e-mail, please notify the sender of 
this message immediately.  Continuing to send or respond to e-mail after 
receiving this message means you understand and accept this risk and wish to 
continue to communicate over unencrypted e-mail. 

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to