Richard,
I currently have no problem with running out of memory. I was referring to people who have said they use LARGE structures and I am pointing out how they can temporarily get way larger even when not expected. Functions that temporarily will balloon up might come with notifications. And, yes, some transformations may well be doable outside R or in chunks. What gets me is how often users have no idea what happens when they invoke a package. I am not against transformations and needed duplications. I am more interested in whether some existing code might be evaluated and updated in somewhat harmless ways as in removing stuff as soon as it is definitely not needed. Of course there are tradeoffs. I have seen times only one column of a data.frame was needed and the entire data.frame was copied and then returned. That is OK but clearly it might be more economical to ask just for a single column to be changed in place. People often use a sledgehammer when a thumbtack will do. But as noted, R has features that often delay things so a full copy is not made and thus less memory is ever used. But people seem to think that since all “local” memory is generally returned when the function ends, so why bother micromanaging it as it runs. Arguably, some R packages may make changes in what is kept and for how long. Standard R lets you specify what rows and what columns of a data.frame to keep in a single argument as in df[rows, columns] while something like dplyr offers multiple smaller steps in a grammar of sorts so you do something like a select() followed (often in a pipeline) by a filter() or done in the opposite order. Each additional change is sometimes done by programmers in minimal steps so that a more efficient implementation is harder to do as each one does just one thing well. That may also be a plus, especially if pipelined objects are released in progress and not all at the end of the pipeline. From: Richard O'Keefe <rao...@gmail.com> Sent: Sunday, November 28, 2021 3:54 AM To: Avi Gross <avigr...@verizon.net> Cc: R-help Mailing List <r-help@r-project.org> Subject: Re: [R] Large data and space use If you have enough data that running out of memory is a serious problem, then a language like R or Python or Octave or Matlab that offers you NO control over storage may not be the best choice. You might need to consider Julia or even Rust. However, if you have enough data that running out of memory is a serious problem, your problems may be worse than you think. In 2021, Linux is *still* having OOM Killer problems. https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/ Your process hogging memory may cause some other process to be killed. Even if that doesn't happen, your process may be simply thrown off the machine without being warned. It may be one of the biggest problems around in statistical computing: how to make it straightforward to carve up a problem so that it can be run on many machines. R has the 'Rmpi' and 'snow' packages, amongst others. https://CRAN.R-project.org/view=HighPerformanceComputing Another approach is to select and transform data outside R. If you have data in some kind of data base then doing select and transform in the data base may be a good approach. On Sun, 28 Nov 2021 at 06:57, Avi Gross via R-help <r-help@r-project.org <mailto:r-help@r-project.org> > wrote: Several recent questions and answers have mad e me look at some code and I realized that some functions may not be great to use when you are dealing with very large amounts of data that may already be getting close to limits of your memory. Does the function you call to do one thing to your object perhaps overdo it and make multiple copies and not delete them as soon as they are not needed? An example was a recent post suggesting a nice set of tools you can use to convert your data.frame so the columns are integers or dates no matter how they were read in from a CSV file or created. What I noticed is that often copies of a sort were made by trying to change the original say to one date format or another and then deciding which, if any to keep. Sometimes multiple transformations are tried and this may be done repeatedly with intermediates left lying around. Yes, the memory will all be implicitly returned when the function completes. But often these functions invoke yet other functions which work on their copies. You an end up with your original data temporarily using multiple times as much actual memory. R does have features so some things are "shared" unless one copy or another changes. But in the cases I am looking at, changes are the whole idea. What I wonder is whether such functions should clearly call an rm() or the equivalent as soon as possible when something is no longer needed. The various kinds of pipelines are another case in point as they involve all kinds of hidden temporary variables that eventually need to be cleaned up. When are they removed? I have seen pipelines with 10 or more steps as perhaps data is read in, has rows removed or columns removed or re-ordered and grouping applied and merged with others and reports generated. The intermediates are often of similar sizes with the data and if large, can add up. If writing the code linearly using temp1 and temp2 type of variables to hold the output of one stage and the input of the text stage, I would be tempted to add a rm(temp1) as soon as it was finished being used, or just reuse the same name of temp1 so the previous contents are no longer being pointed to and can be taken by the garbage collector at some time. So I wonder if some functions should have a note in their manual pages specifying what may happen to the volume of data as they run. An example would be if I had a function that took a matrix and simply squared it using matrix multiplication. There are various ways to do this and one of them simply makes a copy and invokes the built-in way in R that multiplies two matrices. It then returns the result. So you end up storing basically three times the size of the matrix right before you return it. Other methods might do the actual multiplication in loops operating on subsections of the matrix and if done carefully, never keep more than say 2.1 times as much data around. Or is this not important often enough? All I know, is data may be getting larger much faster than memory in our machines gets larger. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org <mailto:R-help@r-project.org> mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.