Carl Boettiger created ARROW-17541:
--------------------------------------

             Summary: [R] Substantial RAM use increase in 9.0.0 release on 
write_dataset()
                 Key: ARROW-17541
                 URL: https://issues.apache.org/jira/browse/ARROW-17541
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 9.0.0
            Reporter: Carl Boettiger


Consider the following example of opening a remote dataset (a single 4 GB 
parquet file) and streaming it to disk. Consider this reprex:

 
s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", 
anonymous=TRUE)
df <- arrow::open_dataset(s3$path("waq_test"))
arrow::write_dataset(df, tempfile())
 

In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already 
surprisingly high (when the whole file is 4 GB when on disk), but on arrow 
9.0.0 RAM use for the same operation approximately doubles, which is large 
enough to trigger the OOM killer on the task in several of our active 
production workflows. 

 

Can this large RAM use increase introduced in 9.0 be avoided?  Is it possible 
for this operation to use even less RAM than it does in 8.0 release?  Is there 
something about this particular parquet file that should be responsible for the 
large RAM use? 

 

Arrow's impressively fast performance on large data on remote hosts is really 
game-changing for us.  Still, the OOM errors are a bit unexpected at this scale 
(i.e. single 4GB parquet file), as R users we really depend on arrow's 
out-of-band operations to work with larger-than-RAM data.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to