[ https://issues.apache.org/jira/browse/ARROW-18114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632000#comment-17632000 ]
Carl Boettiger commented on ARROW-18114: ---------------------------------------- Any update on this? I think we could realize a pretty substantial performance boost in both time and maybe RAM if `unified_schemas=FALSE` could allow us not to touch all the parquet files before we need to! > [R] unify_schemas=FALSE does not improve open_dataset() read times > ------------------------------------------------------------------ > > Key: ARROW-18114 > URL: https://issues.apache.org/jira/browse/ARROW-18114 > Project: Apache Arrow > Issue Type: Bug > Components: R > Reporter: Carl Boettiger > Priority: Major > > open_dataset() provides the very helpful optional argument to set > unify_schemas=FALSE, which should allow arrow to inspect a single parquet > file instead of touching potentially thousands or more parquet files to > determine a consistent unified schema. This ought to provide a substantial > performance increase in contexts where the schema is known in advance. > Unfortunately, in my tests it seems to have no impact on performance. > Consider the following reprexes: > default, unify_schemas=TRUE > {code:java} > library(arrow) > ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", > endpoint_override = "data.ecoforecast.org", anonymous=TRUE) > bench::bench_time( > { open_dataset(ex) } > ){code} > about 32 seconds for me. > manual, unify_schemas=FALSE: > {code:java} > bench::bench_time({ > open_dataset(ex, unify_schemas = FALSE) > }){code} > takes about 32 seconds as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)