[ https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585638#comment-17585638 ]
Carl Boettiger commented on ARROW-17541: ---------------------------------------- Thanks Weston for the pointers here! Yes, tempfile() in the example above returns a local path, though I see the unexpectedly high RAM use in a variety of settings: * In the example described above, tempfile() for me is a local NVMe disk, while the reading is from a remote server, so I would not expect the read-in to be faster than the write-out in this case. * in our original case, we are streaming out to another S3 bucket where we see the crash, rather than writing to a local tempfile() In both settings I'm testing on a Linux platform. On Arrow 9.0 I see peak RAM use at nearly 30 GB in "RES" category (e.g. in glances/top) or as "used" if I run `free -h` from bash. The RStudio interface shows this same number (though benchmarking software like bench::bench_memory() does not see this RAM use, reporting it as only a few hundred KB). Sorry I don't know my way around the situations better. Can you successfully run the code above? (It doesn't require any authentication, should be a reprex, I've run it in a lot of places). In every platform in which I run those three lines, I see RAM use 2-3x higher in arrow 9.0 and always way higher than the 4GB file size. Is that expected? > [R] Substantial RAM use increase in 9.0.0 release on write_dataset() > -------------------------------------------------------------------- > > Key: ARROW-17541 > URL: https://issues.apache.org/jira/browse/ARROW-17541 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 9.0.0 > Reporter: Carl Boettiger > Priority: Major > > Consider the following example of opening a remote dataset (a single 4 GB > parquet file) and streaming it to disk. Consider this reprex: > > {code:java} > s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", > anonymous=TRUE) > df <- arrow::open_dataset(s3$path("waq_test")) > arrow::write_dataset(df, tempfile()) > {code} > In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already > surprisingly high (when the whole file is 4 GB when on disk), but on arrow > 9.0.0 RAM use for the same operation approximately doubles, which is large > enough to trigger the OOM killer on the task in several of our active > production workflows. > > Can this large RAM use increase introduced in 9.0 be avoided? Is it possible > for this operation to use even less RAM than it does in 8.0 release? Is > there something about this particular parquet file that should be responsible > for the large RAM use? > > Arrow's impressively fast performance on large data on remote hosts is really > game-changing for us. Still, the OOM errors are a bit unexpected at this > scale (i.e. single 4GB parquet file), as R users we really depend on arrow's > out-of-band operations to work with larger-than-RAM data. > -- This message was sent by Atlassian Jira (v8.20.10#820010)