[ 
https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585638#comment-17585638
 ] 

Carl Boettiger commented on ARROW-17541:
----------------------------------------

Thanks Weston for the pointers here!

 

Yes, tempfile() in the example above returns a local path, though I see the 
unexpectedly high RAM use in a variety of settings:
 * In the example described above, tempfile() for me is a local NVMe disk, 
while the reading is from a remote server, so I would not expect the read-in to 
be faster than the write-out in this case.
 * in our original case, we are streaming out to another S3 bucket where we see 
the crash, rather than writing to a local tempfile()

In both settings I'm testing on a Linux platform.  On Arrow 9.0 I see peak RAM 
use at nearly 30 GB in "RES" category (e.g. in glances/top) or as "used" if I 
run `free -h` from bash.  The RStudio interface shows this same number (though 
benchmarking software like bench::bench_memory() does not see this RAM use, 
reporting it as only a few hundred KB).  

Sorry I don't know my way around the situations better.  Can you successfully 
run the code above?  (It doesn't require any authentication, should be a 
reprex, I've run it in a lot of places).  In every platform in which I run 
those three lines, I see RAM use 2-3x higher in arrow 9.0 and always way higher 
than the 4GB file size.  Is that expected?

> [R] Substantial RAM use increase in 9.0.0 release on write_dataset()
> --------------------------------------------------------------------
>
>                 Key: ARROW-17541
>                 URL: https://issues.apache.org/jira/browse/ARROW-17541
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 9.0.0
>            Reporter: Carl Boettiger
>            Priority: Major
>
> Consider the following example of opening a remote dataset (a single 4 GB 
> parquet file) and streaming it to disk. Consider this reprex:
>  
> {code:java}
> s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", 
> anonymous=TRUE)
> df <- arrow::open_dataset(s3$path("waq_test"))
> arrow::write_dataset(df, tempfile())
>  {code}
> In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already 
> surprisingly high (when the whole file is 4 GB when on disk), but on arrow 
> 9.0.0 RAM use for the same operation approximately doubles, which is large 
> enough to trigger the OOM killer on the task in several of our active 
> production workflows. 
>  
> Can this large RAM use increase introduced in 9.0 be avoided?  Is it possible 
> for this operation to use even less RAM than it does in 8.0 release?  Is 
> there something about this particular parquet file that should be responsible 
> for the large RAM use? 
>  
> Arrow's impressively fast performance on large data on remote hosts is really 
> game-changing for us.  Still, the OOM errors are a bit unexpected at this 
> scale (i.e. single 4GB parquet file), as R users we really depend on arrow's 
> out-of-band operations to work with larger-than-RAM data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to