[ https://issues.apache.org/jira/browse/ARROW-13611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423038#comment-17423038 ]
Carl Boettiger edited comment on ARROW-13611 at 9/30/21, 10:59 PM: ------------------------------------------------------------------- Any news on this? I believe this bug is still the cause of the system crashes I see when trying to access large parquet files in arrow, e.g. in R: {code:r} library(arrow) library(dplyr) file <- "part-0.parquet"download.file("https://minio.cirrus.carlboettiger.info/shared-data/birddb/parquet/part-0.parquet", file) ds <- open_dataset(file, format = "parquet") ds %>% filter(COUNTRY == "Mexico", `COMMON NAME`=="Wood thrush") %>% compute() {code} was (Author: cboettig): Any news on this? I believe this bug is still the cause of the system crashes I see when trying to access large parquet files in arrow, e.g. in R: ```r library(arrow) library(dplyr) file <- "part-0.parquet"download.file("https://minio.cirrus.carlboettiger.info/shared-data/birddb/parquet/part-0.parquet", file) ds <- open_dataset(file, format = "parquet") ## OOM after consuming ~ 100 GB of RAM, crashes R ds %>% filter(COUNTRY == "Mexico", `COMMON NAME`=="Wood thrush") %>% compute() ``` > [C++] Scanning datasets does not enforce back pressure > ------------------------------------------------------ > > Key: ARROW-13611 > URL: https://issues.apache.org/jira/browse/ARROW-13611 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Affects Versions: 4.0.0, 5.0.0, 4.0.1 > Reporter: Weston Pace > Priority: Major > Fix For: 6.0.0 > > > I have a simple test case where I scan the batches of a 4GB dataset and print > out the currently used memory: > {code:python} > import pyarrow as pa > import pyarrow.dataset as ds > dataset = ds.dataset('/home/pace/dev/data/dataset/csv/5_big', format='csv') > num_rows = 0 > for batch in dataset.to_batches(): > print(pa.total_allocated_bytes()) > num_rows += batch.num_rows > print(num_rows) > {code} > In pyarrow 3.0.0 this consumes just over 5MB. In pyarrow 4.0.0 and 5.0.0 > this consumes multiple GB of RAM. -- This message was sent by Atlassian Jira (v8.3.4#803005)