[ https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530962#comment-17530962 ]
Carl Boettiger commented on ARROW-15081: ---------------------------------------- Thanks Weston, I'll try that. Just to make sure I'm testing the right thing, it will suffice to test the nightlies, arrow-7.0.0.20220501 With that version I still see high RAM use that leads to a crash (ie. after it exceeds the 50 GB RAM I allocate to my container), e.g. which should be reproducible with this example: {code:java} ## library(arrow) library(dplyr) packageVersion("arrow") path <- arrow::s3_bucket("ebird/Mar-2022/observations", endpoint_override = "minio.carlboettiger.info", anonymous=TRUE) obs <- arrow::open_dataset(path) tmp <- obs |> group_by(sampling_event_identifier, scientific_name) |> summarize(count = sum(observation_count, na.rm=TRUE), .groups = "drop") tmp <- tmp |> compute() # crashes {code} > [R][C++] Arrow crashes (OOM) on R client with large remote parquet files > ------------------------------------------------------------------------ > > Key: ARROW-15081 > URL: https://issues.apache.org/jira/browse/ARROW-15081 > Project: Apache Arrow > Issue Type: Bug > Components: R > Reporter: Carl Boettiger > Assignee: Weston Pace > Priority: Major > > The below should be a reproducible crash: > {code:java} > library(arrow) > library(dplyr) > server <- arrow::s3_bucket("ebird",endpoint_override = > "minio.cirrus.carlboettiger.info") > path <- server$path("Oct-2021/observations") > obs <- arrow::open_dataset(path) > path$ls() # observe -- 1 parquet file > obs %>% count() # CRASH > obs %>% to_duckdb() # also crash{code} > I have attempted to split this large (~100 GB parquet file) into some smaller > files, which helps: > {code:java} > path <- server$path("partitioned") > obs <- arrow::open_dataset(path) > obs$ls() # observe, multiple parquet files now > obs %>% count() > {code} > (These parquet files have also been created by arrow, btw, from a single > large csv file provided by the original data provider (eBird). Unfortunately > generating the partitioned versions is cumbersome as the data is very > unevenly distributed, there's few columns that can avoid creating 1000s of > parquet partition files and even so the bulk of the 1-billion rows fall > within the same group. But all the same I think this is a bug as there's no > indication why arrow cannot handle a single 100GB parquet file I think?). > > Let me know if I can provide more info! I'm testing in R with latest CRAN > version of arrow on a machine with 200 GB RAM. -- This message was sent by Atlassian Jira (v8.20.7#820007)