[ https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Carl Boettiger updated ARROW-17541: ----------------------------------- Description: Consider the following example of opening a remote dataset (a single 4 GB parquet file) and streaming it to disk. Consider this reprex: {code:java} s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", anonymous=TRUE) df <- arrow::open_dataset(s3$path("waq_test")) arrow::write_dataset(df, tempfile()) {code} In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already surprisingly high (when the whole file is 4 GB when on disk), but on arrow 9.0.0 RAM use for the same operation approximately doubles, which is large enough to trigger the OOM killer on the task in several of our active production workflows. Can this large RAM use increase introduced in 9.0 be avoided? Is it possible for this operation to use even less RAM than it does in 8.0 release? Is there something about this particular parquet file that should be responsible for the large RAM use? Arrow's impressively fast performance on large data on remote hosts is really game-changing for us. Still, the OOM errors are a bit unexpected at this scale (i.e. single 4GB parquet file), as R users we really depend on arrow's out-of-band operations to work with larger-than-RAM data. was: Consider the following example of opening a remote dataset (a single 4 GB parquet file) and streaming it to disk. Consider this reprex: s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", anonymous=TRUE) df <- arrow::open_dataset(s3$path("waq_test")) arrow::write_dataset(df, tempfile()) In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already surprisingly high (when the whole file is 4 GB when on disk), but on arrow 9.0.0 RAM use for the same operation approximately doubles, which is large enough to trigger the OOM killer on the task in several of our active production workflows. Can this large RAM use increase introduced in 9.0 be avoided? Is it possible for this operation to use even less RAM than it does in 8.0 release? Is there something about this particular parquet file that should be responsible for the large RAM use? Arrow's impressively fast performance on large data on remote hosts is really game-changing for us. Still, the OOM errors are a bit unexpected at this scale (i.e. single 4GB parquet file), as R users we really depend on arrow's out-of-band operations to work with larger-than-RAM data. > [R] Substantial RAM use increase in 9.0.0 release on write_dataset() > -------------------------------------------------------------------- > > Key: ARROW-17541 > URL: https://issues.apache.org/jira/browse/ARROW-17541 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 9.0.0 > Reporter: Carl Boettiger > Priority: Major > > Consider the following example of opening a remote dataset (a single 4 GB > parquet file) and streaming it to disk. Consider this reprex: > > {code:java} > s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", > anonymous=TRUE) > df <- arrow::open_dataset(s3$path("waq_test")) > arrow::write_dataset(df, tempfile()) > {code} > In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already > surprisingly high (when the whole file is 4 GB when on disk), but on arrow > 9.0.0 RAM use for the same operation approximately doubles, which is large > enough to trigger the OOM killer on the task in several of our active > production workflows. > > Can this large RAM use increase introduced in 9.0 be avoided? Is it possible > for this operation to use even less RAM than it does in 8.0 release? Is > there something about this particular parquet file that should be responsible > for the large RAM use? > > Arrow's impressively fast performance on large data on remote hosts is really > game-changing for us. Still, the OOM errors are a bit unexpected at this > scale (i.e. single 4GB parquet file), as R users we really depend on arrow's > out-of-band operations to work with larger-than-RAM data. > -- This message was sent by Atlassian Jira (v8.20.10#820010)