[ 
https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598117#comment-17598117
 ] 

Weston Pace commented on ARROW-17541:
-------------------------------------

So I'm pretty sure that the real problem is that R's garbage collector is 
somehow holding onto pool memory.  This wasn't new in 9.0.0.  As you said 
yourself, it was already using a pretty excessive amount of RAM in 8.0.0.  
Notice that, in python, downloading this file uses less than 1GB of RAM.  I'm 
not going to worry too much about the fact that RAM rate doubled between 8.0.0 
and 9.0.0.  I suspect that may just be that we are more aggressive with 
readahead on these sorts of files in 9.0.0 (in 8.0.0 we always read ahead 8 
batches, in 9.0.0 the readahead is based on # of rows which leads to 20 batches 
on this file).

R's garbage collector is not running for two reasons:

1. R is not aware there is any memory pressure, because it doesn't see the RAM 
used by Arrow pool memory.  As far as it is concerned it is only holding onto 
80MB and in reality it is holding onto multiple GB of RAM.

2. We are executing a single (admittedly long running) C statement with 
{{write_dataset}}.  R's garbage collector will not (I think) run mid-execution.

We could investigate the above two issues but there is a third, more concerning 
problem:

3. R is holding onto memory when it isn't clear to me it should even be able to 
see the memory.

The allocations we are making in Arrow come from the memory pool, they are 
owned by record batch objects, and those record batch objects are never (as far 
as I know) converted to R.  Perhaps they are being converted to R somewhere (we 
are scanning and then writing, do we scan into R before we send to the write 
node?  I wouldn't think so but I could be wrong).  Or perhaps R's memory 
allocator works in some strange way I'm not aware of.

I'm going to have to step back from this investigation as I've hit my limit for 
the week and there is other work I am on the hook for.  So if any R aficionados 
want to investigate I'd be grateful.

> [R] Substantial RAM use increase in 9.0.0 release on write_dataset()
> --------------------------------------------------------------------
>
>                 Key: ARROW-17541
>                 URL: https://issues.apache.org/jira/browse/ARROW-17541
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 9.0.0
>            Reporter: Carl Boettiger
>            Priority: Major
>         Attachments: Screenshot 2022-08-30 at 14-23-20 Online Graph Maker · 
> Plotly Chart Studio.png
>
>
> Consider the following example of opening a remote dataset (a single 4 GB 
> parquet file) and streaming it to disk. Consider this reprex:
>  
> {code:java}
> s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", 
> anonymous=TRUE)
> df <- arrow::open_dataset(s3$path("waq_test"))
> arrow::write_dataset(df, tempfile())
>  {code}
> In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already 
> surprisingly high (when the whole file is 4 GB when on disk), but on arrow 
> 9.0.0 RAM use for the same operation approximately doubles, which is large 
> enough to trigger the OOM killer on the task in several of our active 
> production workflows. 
>  
> Can this large RAM use increase introduced in 9.0 be avoided?  Is it possible 
> for this operation to use even less RAM than it does in 8.0 release?  Is 
> there something about this particular parquet file that should be responsible 
> for the large RAM use? 
>  
> Arrow's impressively fast performance on large data on remote hosts is really 
> game-changing for us.  Still, the OOM errors are a bit unexpected at this 
> scale (i.e. single 4GB parquet file), as R users we really depend on arrow's 
> out-of-band operations to work with larger-than-RAM data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to