Re: [R] Cores hang when calling mcapply

Gregg Powell via R-help Wed, 11 Dec 2024 11:12:47 -0800

How is the server configured to handle memory distribution for individual 
users. I see it has over 700GB of total system memory, but how much can be 
assigned it each individual user?


AAgain - just curious, and wondering how much memory was assigned to your 
instance when you were running R.

regards,
Gregg



On Wednesday, December 11th, 2024 at 9:49 AM, Deramus, Thomas Patrick 
<tdera...@mgb.org> wrote:

> It's Redhat Enterprise Linux 9
> 

> Specifically:
> OS Information:
> NAME="Red Hat Enterprise Linux"
> VERSION="9.3 (Plow)"
> ID="rhel"
> ID_LIKE="fedora"
> VERSION_ID="9.3"
> PLATFORM_ID="platform:el9"
> PRETTY_NAME="Red Hat Enterprise Linux 9.3 (Plow)"
> ANSI_COLOR="0;31"
> LOGO="fedora-logo-icon"
> CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
> HOME_URL="https://www.redhat.com/";
> DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9";
> BUG_REPORT_URL="https://bugzilla.redhat.com/";
> REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
> REDHAT_BUGZILLA_PRODUCT_VERSION=9.3
> REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
> REDHAT_SUPPORT_PRODUCT_VERSION="9.3"
> Operating System: Red Hat Enterprise Linux 9.3 (Plow)
>      CPE OS Name: cpe:/o:redhat:enterprise_linux:9::baseos
>           Kernel: Linux 5.14.0-362.13.1.el9_3.x86_64
>     Architecture: x86-64
>  Hardware Vendor: Dell Inc.
>   Hardware Model: PowerEdge R840
> Firmware Version: 2.15.1
> 

> 

> Regarding RAM restrictions, here are the specs:
>                total        used        free      shared  buff/cache   
> available
> Mem:           753Gi        70Gi       600Gi       2.9Gi        89Gi       
> 683Gi
> Swap:          4.0Gi       2.5Gi       1.5Gi
> 

> It's a multi-user server so naturally things fluctuate.
> 

> Regarding possible CPU restrictions, here are the specs of our server:
>     Thread(s) per core:  2
>     Core(s) per socket:  20
>     Socket(s):           4
>     Stepping:            4
>     CPU(s) scaling MHz:  50%
>     CPU max MHz:         3700.0000
>     CPU min MHz:         1000.0000
> 

> 

> 

> 

> 

> From: Gregg Powell <g.a.pow...@protonmail.com>
> Sent: Wednesday, December 11, 2024 11:41 AM
> To: Deramus, Thomas Patrick <tdera...@mgb.org>
> Cc: r-help@r-project.org <r-help@r-project.org>
> Subject: Re: [R] Cores hang when calling mcapply
> 

> Thomas,
> I'm curious - what OS are you running this on, and how much memory does the 
> computer have? 
> 

> Let me know if that code worked out as I hoped.
> 

> regards,
> gregg
> 

> 

> On Wednesday, December 11th, 2024 at 6:51 AM, Deramus, Thomas Patrick 
> <tdera...@mgb.org> wrote:
> 

> > About to try this implementation.
> > 

> > As a follow-up, this is the exact error:
> > 

> > Lost warning messages
> > Error: no more error handlers available (recursive errors?); invoking 
> > 'abort' restart
> > Execution halted
> > Error: cons memory exhausted (limit reached?)
> > Error: cons memory exhausted (limit reached?)
> > Error: cons memory exhausted (limit reached?)
> > Error: cons memory exhausted (limit reached?)
> > 

> > 

> > 

> > From: Gregg Powell <g.a.pow...@protonmail.com>
> > Sent: Tuesday, December 10, 2024 7:52 PM
> > To: Deramus, Thomas Patrick <tdera...@mgb.org>
> > Cc: r-help@r-project.org <r-help@r-project.org>
> > Subject: Re: [R] Cores hang when calling mcapply
> > 

> > Hello Thomas,
> > 

> > Consider that the primary bottleneck may be tied to memory usage and the 
> > complexity of pivoting extremely large datasets into wide formats with tens 
> > of thousands of unique values per column. Extremely large expansions of 
> > columns inherently stress both memory and CPU, and splitting into 110k 
> > separate data frames before pivoting and combining them again is likely 
> > causing resource overhead and system instability.
> > 

> > Perhaps, evaluate if the presence/absence transformation can be done in a 
> > more memory-efficient manner without pivoting all at once. Since you are 
> > dealing with extremely large data, a more incremental or streaming approach 
> > may be necessary. Instead of splitting into thousands of individual data 
> > frames and trying to pivot each in parallel, consider instead  a method 
> > that processes segments of data to incrementally build a large sparse 
> > matrix or a compressed representation, then combine results at the end.
> > 

> > It's probbaly better to move away from `pivot_wider()` on a massive scale 
> > and attempt a data.table-based approach, which is often more 
> > memory-efficient and faster for large-scale operations in R.
> > 

> > 

> > An alternate way would be data.table’s `dcast()` can handle large data more 
> > efficiently, and data.table’s in-memory operations often reduce overhead 
> > compared to tidyverse pivoting functions.
> > 

> > Also - consider using data.table’s `fread()` or `arrow::open_dataset()` 
> > directly with `as.data.table()` to keep everything in a data.table format. 
> > For example, you can do a large `dcast()` operation to create 
> > presence/absence columns by group. If your categories are extremely large, 
> > consider an approach that processes categories in segments as I mentioned 
> > earlier -  and writes intermediate results to disk, then 
> > combines/mergesresults at the end.
> > 

> > Limit parallelization when dealing with massive reshapes. Instead of trying 
> > to parallelize the entire pivot across thousands of subsets, run a single 
> > parallelized chunking approach that processes manageable subsets and writes 
> > out intermediate results (for example... using `fwrite()` for each subset). 
> > After processing, load and combine these intermediate results. This manual 
> > segmenting approach can circumvent the "zombie" processes you mentioned - 
> > that I think arise from overly complex parallel nesting and excessivememory 
> > utilization.
> > 

> > If the presence/absence indicators are ultimately sparse (many zeros and 
> > few ones), consider storing the result in a sparse matrix format (for 
> > exapmple- `Matrix` package in R). Instead of creating thousands of columns 
> > as dense integers, using a sparse matrix representation should dramatically 
> > reduce memory. After processing the data into a sparse format, you can then 
> > save it in a suitable file format and only convert to a dense format if 
> > absolutely necessary.
> > 

> > Below is a reworked code segment using data.table for a more scalable 
> > approach. Note that this is a conceptual template. In practice, adapt the 
> > chunk sizes and filtering operations to your workflow. The idea is to avoid 
> > creating 110k separate data frames and to handle the pivot in a data.table 
> > manner that’s more robust and less memory intensve. Here, presence/absence 
> > encoding is done by grouping and casting directly rather than repeatedly 
> > splitting and row-binding.
> > 

> > > library(data.table)
> > > library(arrow)
> > >
> > > # Step A: Load data efficiently as data.table
> > > dt <- as.data.table(
> > >   open_dataset(
> > >    sources = input_files,
> > >    format = 'csv',
> > >    unify_schema = TRUE,
> > >    col_types = schema(
> > >      "ID_Key" = string(),
> > >      "column1" = string(),
> > >      "column2" = string()
> > >    )
> > >  ) |>
> > 

> > >    collect()
> > > )
> > >
> > > # Step B: Clean names once
> > > # Assume `crewjanitormakeclean` essentially standardizes column names
> > > dt[, column1 := janitor::make_clean_names(column1, allow_dupes = 
> > 

> > > TRUE)]
> > > dt[, column2 := janitor::make_clean_names(column2, allow_dupes =
> > 

> > >  TRUE)]
> > >
> > > # Step C: Create presence/absence indicators using data.table
> > > # Use dcast to pivot wide. Set n=1 for presence, 0 for absence.
> > > # For large unique values, consider chunking if needed.
> > > out1 <- dcast(dt[!is.na(column1)], ID_Key ~ column1, fun.aggregate =
> > 

> > > length, value.var = "column1")
> > > out2 <- dcast(dt[!is.na(column2)], ID_Key ~ column2, fun.aggregate =
> > 

> > > length, value.var = "column2")
> > >
> > > # Step D: Merge the two wide tables by ID_Key
> > > # Fill missing columns with 0 using data.table on-the-fly operations
> > > all_cols <- unique(c(names(out1), names(out2)))
> > > out1_missing <- setdiff(all_cols, names(out1))
> > > out2_missing <- setdiff(all_cols, names(out2))
> > >
> > > # Add missing columns with 0
> > > for (col in out1_missing) out1[, (col) := 0]
> > > for (col in out2_missing) out2[, (col) := 0]
> > >
> > > # Ensure column order alignment if needed
> > > setcolorder(out1, all_cols)
> > > setcolorder(out2, all_cols)
> > >
> > > # Combine by ID_Key (since they share same columns now)
> > > final_dt <- rbindlist(list(out1, out2), use.names = TRUE, fill = TRUE)
> > >
> > > # Step E: If needed, summarize across ID_Key to sum presence
> > 

> > > indicators
> > > final_result <- final_dt[, lapply(.SD, sum, na.rm = TRUE), by =
> > 

> > > ID_Key, .SDcols = setdiff(names(final_dt), "ID_Key")]
> > >
> > > # note that final_result should now contain summed presence/absence
> > 

> > > (0/1) indicators.
> > 

> > 

> > 

> > 

> > Hope this helps!
> > gregg
> > somewhereinArizona
> > 

> > The information in this e-mail is intended only for the person to whom it 
> > is addressed.  If you believe this e-mail was sent to you in error and the 
> > e-mail contains patient information, please contact the Mass General 
> > Brigham Compliance HelpLine at 
> > https://www.massgeneralbrigham.org/complianceline .
> > 

> > 

> > 

> > Please note that this e-mail is not secure (encrypted).  If you do not wish 
> > to continue communication over unencrypted e-mail, please notify the sender 
> > of this message immediately.  Continuing to send or respond to e-mail after 
> > receiving this message means you understand and accept this risk and wish 
> > to continue to communicate over unencrypted e-mail. 
> 

> The information in this e-mail is intended only for the person to whom it is 
> addressed.  If you believe this e-mail was sent to you in error and the 
> e-mail contains patient information, please contact the Mass General Brigham 
> Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
> 

> 

> 

> Please note that this e-mail is not secure (encrypted).  If you do not wish 
> to continue communication over unencrypted e-mail, please notify the sender 
> of this message immediately.  Continuing to send or respond to e-mail after 
> receiving this message means you understand and accept this risk and wish to 
> continue to communicate over unencrypted e-mail.

signature.asc
Description: OpenPGP digital signature

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cores hang when calling mcapply

Reply via email to