Re: [R] Large data and space use

Avi Gross via R-help Sun, 28 Nov 2021 22:49:44 -0800

Richard,


I currently have no problem with running out of memory. I was referring to 
people who have said they use LARGE structures and I am pointing out how they 
can temporarily get way larger even when not expected. Functions that 
temporarily will balloon up might come with notifications. And, yes, some 
transformations may well be doable outside R or in chunks. What gets me is how 
often users have no idea what happens when they invoke a package.

 

I am not against transformations and needed duplications. I am more interested 
in whether some existing code might be evaluated and updated in somewhat 
harmless ways as in removing stuff as soon as it is definitely not needed. Of 
course there are tradeoffs. I have seen times only one column of a data.frame 
was needed and the entire data.frame was copied and then returned. That is OK 
but clearly it might be more economical to ask just for a single column to be 
changed in place. People often use a sledgehammer when a thumbtack will do.

 

But as noted, R has features that often delay things so a full copy is not made 
and thus less memory is ever used. But people seem to think that since all 
“local” memory is generally returned when the function ends, so why bother 
micromanaging it as it runs.

 

Arguably, some R packages may make changes in what is kept and for how long. 
Standard R lets you specify what rows and what columns of a data.frame to keep 
in a single argument as in df[rows, columns] while something like dplyr offers 
multiple smaller steps in a grammar of sorts so you do something like a 
select() followed (often in a pipeline) by a filter() or done in the opposite 
order. Each additional change is sometimes done by programmers in minimal steps 
so that a more efficient implementation is harder to do as each one does just 
one thing well. That may also be a plus, especially if pipelined objects are 
released in progress and not all at the end of the pipeline.

 

From: Richard O'Keefe <rao...@gmail.com> 
Sent: Sunday, November 28, 2021 3:54 AM
To: Avi Gross <avigr...@verizon.net>
Cc: R-help Mailing List <r-help@r-project.org>
Subject: Re: [R] Large data and space use

 

If you have enough data that running out of memory is a serious problem,

then a language like R or Python or Octave or Matlab that offers you NO

control over storage may not be the best choice.  You might need to

consider Julia or even Rust.

 

However, if you have enough data that running out of memory is a serious

problem, your problems may be worse than you think.  In 2021, Linux is

*still* having OOM Killer problems.

https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/

Your process hogging memory may cause some other process to be killed.

Even if that doesn't happen, your process may be simply thrown off the

machine without being warned.

 

It may be one of the biggest problems around in statistical computing:

how to make it straightforward to carve up a problem so that it can be

run on many machines.  R has the 'Rmpi' and 'snow' packages, amongst others.

https://CRAN.R-project.org/view=HighPerformanceComputing

 

Another approach is to select and transform data outside R.  If you have

data in some kind of data base then doing select and transform in the

data base may be a good approach.

 

 

On Sun, 28 Nov 2021 at 06:57, Avi Gross via R-help <r-help@r-project.org 
<mailto:r-help@r-project.org> > wrote:

Several recent questions and answers have mad e me look at some code and I
realized that some functions may not be great to use when you are dealing
with very large amounts of data that may already be getting close to limits
of your memory. Does the function you call to do one thing to your object
perhaps overdo it and make multiple copies and not delete them as soon as
they are not needed?



An example was a recent post suggesting a nice set of tools you can use to
convert your data.frame so the columns are integers or dates no matter how
they were read in from a CSV file or created.



What I noticed is that often copies of a sort were made by trying to change
the original say to one date format or another and then deciding which, if
any to keep. Sometimes multiple transformations are tried and this may be
done repeatedly with intermediates left lying around. Yes, the memory will
all be implicitly returned when the function completes. But often these
functions invoke yet other functions which work on their copies. You an end
up with your original data temporarily using multiple times as much actual
memory.



R does have features so some things are "shared" unless one copy or another
changes. But in the cases I am looking at, changes are the whole idea.



What I wonder is whether such functions should clearly call an rm() or the
equivalent as soon as possible when something is no longer needed.



The various kinds of pipelines are another case in point as they involve all
kinds of hidden temporary variables that eventually need to be cleaned up.
When are they removed? I have seen pipelines with 10 or more steps as
perhaps data is read in, has rows removed or columns removed or re-ordered
and grouping applied and merged with others and reports generated. The
intermediates are often of similar sizes with the data and if large, can add
up. If writing the code linearly using temp1 and temp2 type of variables to
hold the output of one stage and the input of the text stage, I would be
tempted to add a rm(temp1) as soon as it was finished being used, or just
reuse the same name of temp1 so the previous contents are no longer being
pointed to and can be taken by the garbage collector at some time.



So I wonder if some functions should have a note in their manual pages
specifying what may happen to the volume of data as they run. An example
would be if I had a function that took a matrix and simply squared it using
matrix multiplication. There are various ways to do this and one of them
simply makes a copy and invokes the built-in way in R that multiplies two
matrices. It then returns the result. So you end up storing basically three
times the size  of the matrix right before you return it. Other methods
might do the actual multiplication in loops operating on subsections of the
matrix and if done carefully, never keep more than say 2.1 times as much
data around. 



Or is this not important often enough? All I know, is data may be getting
larger much faster than memory in our machines gets larger.






        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org <mailto:R-help@r-project.org>  mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Large data and space use

Reply via email to