[ https://issues.apache.org/jira/browse/ARROW-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Keane updated ARROW-15731: ----------------------------------- Description: Currently Arrow joins with data that contain a list column errors, even when the list column is not a join key: {code} library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union jedi <- data.frame(name = c("C-3PO", "Luke Skywalker"), jedi = c(FALSE, TRUE)) arrow_table(starwars) %>% left_join(jedi) %>% collect() #> Error in `handle_csv_read_error()`: #> ! Invalid: Data type list<item: string> is not supported in join non-key field {code} The ability to join would be a useful enhancement for workflows with tabular data where list columns can be common, and for geospatial workflows where geometry columns are stored as `list` or `fixed_size_list` (thanks [~paleolimbot] for mentioning that use case). Related discussion here: ARROW-14519 was: Currently Arrow joins with data that contain a list column errors, even when the list column is not a join key: ``` r library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union jedi <- data.frame(name = c("C-3PO", "Luke Skywalker"), jedi = c(FALSE, TRUE)) arrow_table(starwars) %>% left_join(jedi) %>% collect() #> Error in `handle_csv_read_error()`: #> ! Invalid: Data type list<item: string> is not supported in join non-key field ``` The ability to join would be a useful enhancement for workflows with tabular data where list columns can be common, and for geospatial workflows where geometry columns are stored as `list` or `fixed_size_list` (thanks [~paleolimbot] for mentioning that use case). Related discussion here: https://issues.apache.org/jira/browse/ARROW-14519 > [C++] Enable joins when data contains a list column > --------------------------------------------------- > > Key: ARROW-15731 > URL: https://issues.apache.org/jira/browse/ARROW-15731 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Stephanie Hazlitt > Priority: Major > > Currently Arrow joins with data that contain a list column errors, even when > the list column is not a join key: > {code} > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > jedi <- data.frame(name = c("C-3PO", "Luke Skywalker"), > jedi = c(FALSE, TRUE)) > arrow_table(starwars) %>% > left_join(jedi) %>% > collect() > #> Error in `handle_csv_read_error()`: > #> ! Invalid: Data type list<item: string> is not supported in join non-key > field > {code} > The ability to join would be a useful enhancement for workflows with tabular > data where list columns can be common, and for geospatial workflows where > geometry columns are stored as `list` or `fixed_size_list` (thanks > [~paleolimbot] for mentioning that use case). > Related discussion here: ARROW-14519 > -- This message was sent by Atlassian Jira (v8.20.1#820001)