[ https://issues.apache.org/jira/browse/ARROW-14649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ian Cook updated ARROW-14649: ----------------------------- Description: ARROW-14167 added support for factors in {{{}coalesce(){}}}, but the factors that are returned will not necessarily retain the factor levels like {{coalesce()}} does when used on an R data frame. For example, compare these, noticing the difference in the levels: {code:r} # R data frame tibble(x = factor(c("a", NA_character_)), y = factor(c("b", "c"))) %>% mutate(y = coalesce(x, y)) %>% pull(y) #> [1] a c #> Levels: a b c{code} {code:r} # Arrow Table tibble(x = factor(c("a", NA_character_)), y = factor(c("b", "c"))) %>% Table$create() %>% mutate(y = coalesce(x, y)) %>% pull(y) #> [1] a c #> Levels: a c{code} Similarly, ARROW-13358 and ARROW-14659 added support for factors in {{if_else()}} but the returned factors will not always retain the levels like {{if_else()}} does when used on an R data frame. I'm not sure if it is practical to make Arrow return the factors with the unused levels included like R does. If so, we should do it. See the tests in {{test-dplyr-funcs-conditional.R}} that refers to this Jira. was: ARROW-14167 added support for factors in {{{}coalesce(){}}}, but the factors that are returned will not necessarily retain the factor levels like {{coalesce()}} does when used on an R data frame. For example, compare these, noticing the difference in the levels: {code:r} # R data frame tibble(x = factor(c("a", NA_character_)), y = factor(c("b", "c"))) %>% mutate(y = coalesce(x, y)) %>% pull(y) #> [1] a c #> Levels: a b c{code} {code:r} # Arrow Table tibble(x = factor(c("a", NA_character_)), y = factor(c("b", "c"))) %>% Table$create() %>% mutate(y = coalesce(x, y)) %>% pull(y) #> [1] a c #> Levels: a c{code} I'm not sure if it is practical to make Arrow return the factors with the unused levels included like R does. If so, we should do it. See the test in {{test-dplyr-funcs-conditional.R}} that refers to this Jira. > [R] Include unused factor levels in coalesce() and if_else() output > ------------------------------------------------------------------- > > Key: ARROW-14649 > URL: https://issues.apache.org/jira/browse/ARROW-14649 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Reporter: Ian Cook > Priority: Minor > > ARROW-14167 added support for factors in {{{}coalesce(){}}}, but the factors > that are returned will not necessarily retain the factor levels like > {{coalesce()}} does when used on an R data frame. > For example, compare these, noticing the difference in the levels: > {code:r} > # R data frame > tibble(x = factor(c("a", NA_character_)), y = factor(c("b", "c"))) %>% > mutate(y = coalesce(x, y)) %>% > pull(y) > #> [1] a c > #> Levels: a b c{code} > {code:r} > # Arrow Table > tibble(x = factor(c("a", NA_character_)), y = factor(c("b", "c"))) %>% > Table$create() %>% > mutate(y = coalesce(x, y)) %>% > pull(y) > #> [1] a c > #> Levels: a c{code} > Similarly, ARROW-13358 and ARROW-14659 added support for factors in > {{if_else()}} but the returned factors will not always retain the levels like > {{if_else()}} does when used on an R data frame. > I'm not sure if it is practical to make Arrow return the factors with the > unused levels included like R does. If so, we should do it. > See the tests in {{test-dplyr-funcs-conditional.R}} that refers to this Jira. -- This message was sent by Atlassian Jira (v8.20.1#820001)