[ https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585635#comment-17585635 ]
Weston Pace commented on ARROW-17541: ------------------------------------- Are you writing the file locally? How are you measuring RAM usage? Is that 8-10GB in the process' RSS space? If so, we probably need to expose controls in R to lower the readahead amount (there are some other long-term options but this would be a short-term fix that would let you tradeoff RAM vs. I/O throughput) On the other hand, if the process' RSS space is lower, but the system available memory is still low (e.g. output from the {{free}} command) then it's probably because your input throughput (e.g. downloading from S3) is faster than your disk's write speed (not too surprising with an HDD or even some SSDs). What happens in this case is the writes are simply memcpy'ing data from RSS into the kernel page cache and marking the pages dirty. The write doesn't actually persist to the disk until sometime later (even possibly after the process has ended). If your write is slower than your read then the kernel's page cache will fill up and clobber all other memory, pushing it to swap or even potentially invoking the oom killer. On the bright side, we do have a [PR in progress|https://github.com/apache/arrow/pull/13640] which should alleviate this problem, at least on Linux. Can you do some investigation to try and figure out which of these possibilities we are encountering? > [R] Substantial RAM use increase in 9.0.0 release on write_dataset() > -------------------------------------------------------------------- > > Key: ARROW-17541 > URL: https://issues.apache.org/jira/browse/ARROW-17541 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 9.0.0 > Reporter: Carl Boettiger > Priority: Major > > Consider the following example of opening a remote dataset (a single 4 GB > parquet file) and streaming it to disk. Consider this reprex: > > {code:java} > s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", > anonymous=TRUE) > df <- arrow::open_dataset(s3$path("waq_test")) > arrow::write_dataset(df, tempfile()) > {code} > In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already > surprisingly high (when the whole file is 4 GB when on disk), but on arrow > 9.0.0 RAM use for the same operation approximately doubles, which is large > enough to trigger the OOM killer on the task in several of our active > production workflows. > > Can this large RAM use increase introduced in 9.0 be avoided? Is it possible > for this operation to use even less RAM than it does in 8.0 release? Is > there something about this particular parquet file that should be responsible > for the large RAM use? > > Arrow's impressively fast performance on large data on remote hosts is really > game-changing for us. Still, the OOM errors are a bit unexpected at this > scale (i.e. single 4GB parquet file), as R users we really depend on arrow's > out-of-band operations to work with larger-than-RAM data. > -- This message was sent by Atlassian Jira (v8.20.10#820010)