[ https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598124#comment-17598124 ]
Dewey Dunnington commented on ARROW-17541: ------------------------------------------ I'm working on two PRs that touch some of those parts of the code and will investigate at some point in the next two weeks. A few thoughts: > R's garbage collector will not (I think) run mid-execution. I believe that's true, although in theory R is also not allocating any memory either. I don't know of any way to allocate R memory without (potentially) running the garbage collector. > R is holding onto memory when it isn't clear to me it should even be able to > see the memory. Would a more precise way to say this be that there is some shared pointer (potentially held by an R6 object that is still in scope and not being destroyed) that is keeping the record batches from being freed? We do have an R reference to the exec plan and the final node of the exec plan (which would be the penultimate node in the dataset write, which is probably the scan node). (It still makes no sense to me why the batches aren't getting released). > [R] Substantial RAM use increase in 9.0.0 release on write_dataset() > -------------------------------------------------------------------- > > Key: ARROW-17541 > URL: https://issues.apache.org/jira/browse/ARROW-17541 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 9.0.0 > Reporter: Carl Boettiger > Priority: Major > Attachments: Screenshot 2022-08-30 at 14-23-20 Online Graph Maker · > Plotly Chart Studio.png > > > Consider the following example of opening a remote dataset (a single 4 GB > parquet file) and streaming it to disk. Consider this reprex: > > {code:java} > s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", > anonymous=TRUE) > df <- arrow::open_dataset(s3$path("waq_test")) > arrow::write_dataset(df, tempfile()) > {code} > In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already > surprisingly high (when the whole file is 4 GB when on disk), but on arrow > 9.0.0 RAM use for the same operation approximately doubles, which is large > enough to trigger the OOM killer on the task in several of our active > production workflows. > > Can this large RAM use increase introduced in 9.0 be avoided? Is it possible > for this operation to use even less RAM than it does in 8.0 release? Is > there something about this particular parquet file that should be responsible > for the large RAM use? > > Arrow's impressively fast performance on large data on remote hosts is really > game-changing for us. Still, the OOM errors are a bit unexpected at this > scale (i.e. single 4GB parquet file), as R users we really depend on arrow's > out-of-band operations to work with larger-than-RAM data. > -- This message was sent by Atlassian Jira (v8.20.10#820010)