[ 
https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585635#comment-17585635
 ] 

Weston Pace commented on ARROW-17541:
-------------------------------------

Are you writing the file locally?  How are you measuring RAM usage?

Is that 8-10GB in the process' RSS space?  If so, we probably need to expose 
controls in R to lower the readahead amount (there are some other long-term 
options but this would be a short-term fix that would let you tradeoff RAM vs. 
I/O throughput)

On the other hand, if the process' RSS space is lower, but the system available 
memory is still low (e.g. output from the {{free}} command) then it's probably 
because your input throughput (e.g. downloading from S3) is faster than your 
disk's write speed (not too surprising with an HDD or even some SSDs).  What 
happens in this case is the writes are simply memcpy'ing data from RSS into the 
kernel page cache and marking the pages dirty.  The write doesn't actually 
persist to the disk until sometime later (even possibly after the process has 
ended).  If your write is slower than your read then the kernel's page cache 
will fill up and clobber all other memory, pushing it to swap or even 
potentially invoking the oom killer.  On the bright side, we do have a [PR in 
progress|https://github.com/apache/arrow/pull/13640] which should alleviate 
this problem, at least on Linux.

Can you do some investigation to try and figure out which of these 
possibilities we are encountering?

> [R] Substantial RAM use increase in 9.0.0 release on write_dataset()
> --------------------------------------------------------------------
>
>                 Key: ARROW-17541
>                 URL: https://issues.apache.org/jira/browse/ARROW-17541
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 9.0.0
>            Reporter: Carl Boettiger
>            Priority: Major
>
> Consider the following example of opening a remote dataset (a single 4 GB 
> parquet file) and streaming it to disk. Consider this reprex:
>  
> {code:java}
> s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", 
> anonymous=TRUE)
> df <- arrow::open_dataset(s3$path("waq_test"))
> arrow::write_dataset(df, tempfile())
>  {code}
> In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already 
> surprisingly high (when the whole file is 4 GB when on disk), but on arrow 
> 9.0.0 RAM use for the same operation approximately doubles, which is large 
> enough to trigger the OOM killer on the task in several of our active 
> production workflows. 
>  
> Can this large RAM use increase introduced in 9.0 be avoided?  Is it possible 
> for this operation to use even less RAM than it does in 8.0 release?  Is 
> there something about this particular parquet file that should be responsible 
> for the large RAM use? 
>  
> Arrow's impressively fast performance on large data on remote hosts is really 
> game-changing for us.  Still, the OOM errors are a bit unexpected at this 
> scale (i.e. single 4GB parquet file), as R users we really depend on arrow's 
> out-of-band operations to work with larger-than-RAM data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to