How is the server configured to handle memory distribution for individual users. I see it has over 700GB of total system memory, but how much can be assigned it each individual user?
AAgain - just curious, and wondering how much memory was assigned to your instance when you were running R. regards, Gregg On Wednesday, December 11th, 2024 at 9:49 AM, Deramus, Thomas Patrick <tdera...@mgb.org> wrote: > It's Redhat Enterprise Linux 9 > > Specifically: > OS Information: > NAME="Red Hat Enterprise Linux" > VERSION="9.3 (Plow)" > ID="rhel" > ID_LIKE="fedora" > VERSION_ID="9.3" > PLATFORM_ID="platform:el9" > PRETTY_NAME="Red Hat Enterprise Linux 9.3 (Plow)" > ANSI_COLOR="0;31" > LOGO="fedora-logo-icon" > CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos" > HOME_URL="https://www.redhat.com/" > DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9" > BUG_REPORT_URL="https://bugzilla.redhat.com/" > REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9" > REDHAT_BUGZILLA_PRODUCT_VERSION=9.3 > REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" > REDHAT_SUPPORT_PRODUCT_VERSION="9.3" > Operating System: Red Hat Enterprise Linux 9.3 (Plow) > CPE OS Name: cpe:/o:redhat:enterprise_linux:9::baseos > Kernel: Linux 5.14.0-362.13.1.el9_3.x86_64 > Architecture: x86-64 > Hardware Vendor: Dell Inc. > Hardware Model: PowerEdge R840 > Firmware Version: 2.15.1 > > > Regarding RAM restrictions, here are the specs: > total used free shared buff/cache > available > Mem: 753Gi 70Gi 600Gi 2.9Gi 89Gi > 683Gi > Swap: 4.0Gi 2.5Gi 1.5Gi > > It's a multi-user server so naturally things fluctuate. > > Regarding possible CPU restrictions, here are the specs of our server: > Thread(s) per core: 2 > Core(s) per socket: 20 > Socket(s): 4 > Stepping: 4 > CPU(s) scaling MHz: 50% > CPU max MHz: 3700.0000 > CPU min MHz: 1000.0000 > > > > > > From: Gregg Powell <g.a.pow...@protonmail.com> > Sent: Wednesday, December 11, 2024 11:41 AM > To: Deramus, Thomas Patrick <tdera...@mgb.org> > Cc: r-help@r-project.org <r-help@r-project.org> > Subject: Re: [R] Cores hang when calling mcapply > > Thomas, > I'm curious - what OS are you running this on, and how much memory does the > computer have? > > Let me know if that code worked out as I hoped. > > regards, > gregg > > > On Wednesday, December 11th, 2024 at 6:51 AM, Deramus, Thomas Patrick > <tdera...@mgb.org> wrote: > > > About to try this implementation. > > > > As a follow-up, this is the exact error: > > > > Lost warning messages > > Error: no more error handlers available (recursive errors?); invoking > > 'abort' restart > > Execution halted > > Error: cons memory exhausted (limit reached?) > > Error: cons memory exhausted (limit reached?) > > Error: cons memory exhausted (limit reached?) > > Error: cons memory exhausted (limit reached?) > > > > > > > > From: Gregg Powell <g.a.pow...@protonmail.com> > > Sent: Tuesday, December 10, 2024 7:52 PM > > To: Deramus, Thomas Patrick <tdera...@mgb.org> > > Cc: r-help@r-project.org <r-help@r-project.org> > > Subject: Re: [R] Cores hang when calling mcapply > > > > Hello Thomas, > > > > Consider that the primary bottleneck may be tied to memory usage and the > > complexity of pivoting extremely large datasets into wide formats with tens > > of thousands of unique values per column. Extremely large expansions of > > columns inherently stress both memory and CPU, and splitting into 110k > > separate data frames before pivoting and combining them again is likely > > causing resource overhead and system instability. > > > > Perhaps, evaluate if the presence/absence transformation can be done in a > > more memory-efficient manner without pivoting all at once. Since you are > > dealing with extremely large data, a more incremental or streaming approach > > may be necessary. Instead of splitting into thousands of individual data > > frames and trying to pivot each in parallel, consider instead a method > > that processes segments of data to incrementally build a large sparse > > matrix or a compressed representation, then combine results at the end. > > > > It's probbaly better to move away from `pivot_wider()` on a massive scale > > and attempt a data.table-based approach, which is often more > > memory-efficient and faster for large-scale operations in R. > > > > > > An alternate way would be data.table’s `dcast()` can handle large data more > > efficiently, and data.table’s in-memory operations often reduce overhead > > compared to tidyverse pivoting functions. > > > > Also - consider using data.table’s `fread()` or `arrow::open_dataset()` > > directly with `as.data.table()` to keep everything in a data.table format. > > For example, you can do a large `dcast()` operation to create > > presence/absence columns by group. If your categories are extremely large, > > consider an approach that processes categories in segments as I mentioned > > earlier - and writes intermediate results to disk, then > > combines/mergesresults at the end. > > > > Limit parallelization when dealing with massive reshapes. Instead of trying > > to parallelize the entire pivot across thousands of subsets, run a single > > parallelized chunking approach that processes manageable subsets and writes > > out intermediate results (for example... using `fwrite()` for each subset). > > After processing, load and combine these intermediate results. This manual > > segmenting approach can circumvent the "zombie" processes you mentioned - > > that I think arise from overly complex parallel nesting and excessivememory > > utilization. > > > > If the presence/absence indicators are ultimately sparse (many zeros and > > few ones), consider storing the result in a sparse matrix format (for > > exapmple- `Matrix` package in R). Instead of creating thousands of columns > > as dense integers, using a sparse matrix representation should dramatically > > reduce memory. After processing the data into a sparse format, you can then > > save it in a suitable file format and only convert to a dense format if > > absolutely necessary. > > > > Below is a reworked code segment using data.table for a more scalable > > approach. Note that this is a conceptual template. In practice, adapt the > > chunk sizes and filtering operations to your workflow. The idea is to avoid > > creating 110k separate data frames and to handle the pivot in a data.table > > manner that’s more robust and less memory intensve. Here, presence/absence > > encoding is done by grouping and casting directly rather than repeatedly > > splitting and row-binding. > > > > > library(data.table) > > > library(arrow) > > > > > > # Step A: Load data efficiently as data.table > > > dt <- as.data.table( > > > open_dataset( > > > sources = input_files, > > > format = 'csv', > > > unify_schema = TRUE, > > > col_types = schema( > > > "ID_Key" = string(), > > > "column1" = string(), > > > "column2" = string() > > > ) > > > ) |> > > > > > collect() > > > ) > > > > > > # Step B: Clean names once > > > # Assume `crewjanitormakeclean` essentially standardizes column names > > > dt[, column1 := janitor::make_clean_names(column1, allow_dupes = > > > > > TRUE)] > > > dt[, column2 := janitor::make_clean_names(column2, allow_dupes = > > > > > TRUE)] > > > > > > # Step C: Create presence/absence indicators using data.table > > > # Use dcast to pivot wide. Set n=1 for presence, 0 for absence. > > > # For large unique values, consider chunking if needed. > > > out1 <- dcast(dt[!is.na(column1)], ID_Key ~ column1, fun.aggregate = > > > > > length, value.var = "column1") > > > out2 <- dcast(dt[!is.na(column2)], ID_Key ~ column2, fun.aggregate = > > > > > length, value.var = "column2") > > > > > > # Step D: Merge the two wide tables by ID_Key > > > # Fill missing columns with 0 using data.table on-the-fly operations > > > all_cols <- unique(c(names(out1), names(out2))) > > > out1_missing <- setdiff(all_cols, names(out1)) > > > out2_missing <- setdiff(all_cols, names(out2)) > > > > > > # Add missing columns with 0 > > > for (col in out1_missing) out1[, (col) := 0] > > > for (col in out2_missing) out2[, (col) := 0] > > > > > > # Ensure column order alignment if needed > > > setcolorder(out1, all_cols) > > > setcolorder(out2, all_cols) > > > > > > # Combine by ID_Key (since they share same columns now) > > > final_dt <- rbindlist(list(out1, out2), use.names = TRUE, fill = TRUE) > > > > > > # Step E: If needed, summarize across ID_Key to sum presence > > > > > indicators > > > final_result <- final_dt[, lapply(.SD, sum, na.rm = TRUE), by = > > > > > ID_Key, .SDcols = setdiff(names(final_dt), "ID_Key")] > > > > > > # note that final_result should now contain summed presence/absence > > > > > (0/1) indicators. > > > > > > > > > > Hope this helps! > > gregg > > somewhereinArizona > > > > The information in this e-mail is intended only for the person to whom it > > is addressed. If you believe this e-mail was sent to you in error and the > > e-mail contains patient information, please contact the Mass General > > Brigham Compliance HelpLine at > > https://www.massgeneralbrigham.org/complianceline . > > > > > > > > Please note that this e-mail is not secure (encrypted). If you do not wish > > to continue communication over unencrypted e-mail, please notify the sender > > of this message immediately. Continuing to send or respond to e-mail after > > receiving this message means you understand and accept this risk and wish > > to continue to communicate over unencrypted e-mail. > > The information in this e-mail is intended only for the person to whom it is > addressed. If you believe this e-mail was sent to you in error and the > e-mail contains patient information, please contact the Mass General Brigham > Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline . > > > > Please note that this e-mail is not secure (encrypted). If you do not wish > to continue communication over unencrypted e-mail, please notify the sender > of this message immediately. Continuing to send or respond to e-mail after > receiving this message means you understand and accept this risk and wish to > continue to communicate over unencrypted e-mail.
signature.asc
Description: OpenPGP digital signature
______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.