[ 
https://issues.apache.org/jira/browse/ARROW-14649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated ARROW-14649:
-----------------------------
    Description: 
ARROW-14167 added support for factors in {{{}coalesce(){}}}, but the factors 
that are returned will not necessarily retain the factor levels like 
{{coalesce()}} does when used on an R data frame.

For example, compare these, noticing the difference in the levels:
{code:r}
# R data frame
tibble(x = factor(c("a", NA_character_)), y = factor(c("b", "c"))) %>%
  mutate(y = coalesce(x, y)) %>%
  pull(y)
#> [1] a c
#> Levels: a b c{code}
{code:r}
# Arrow Table
tibble(x = factor(c("a", NA_character_)), y = factor(c("b", "c"))) %>%
  Table$create() %>%
  mutate(y = coalesce(x, y)) %>%
  pull(y)
#> [1] a c
#> Levels: a c{code}
Similarly, ARROW-13358 and ARROW-14659 added support for factors in 
{{if_else()}} but the returned factors will not always retain the levels like 
{{if_else()}} does when used on an R data frame.

I'm not sure if it is practical to make Arrow return the factors with the 
unused levels included like R does. If so, we should do it.

See the tests in {{test-dplyr-funcs-conditional.R}} that refers to this Jira.

  was:
ARROW-14167 added support for factors in {{{}coalesce(){}}}, but the factors 
that are returned will not necessarily retain the factor levels like 
{{coalesce()}} does when used on an R data frame.

For example, compare these, noticing the difference in the levels:
{code:r}
# R data frame
tibble(x = factor(c("a", NA_character_)), y = factor(c("b", "c"))) %>%
  mutate(y = coalesce(x, y)) %>%
  pull(y)
#> [1] a c
#> Levels: a b c{code}
{code:r}
# Arrow Table
tibble(x = factor(c("a", NA_character_)), y = factor(c("b", "c"))) %>%
  Table$create() %>%
  mutate(y = coalesce(x, y)) %>%
  pull(y)
#> [1] a c
#> Levels: a c{code}
I'm not sure if it is practical to make Arrow return the factors with the 
unused levels included like R does. If so, we should do it.

See the test in {{test-dplyr-funcs-conditional.R}} that refers to this Jira.


> [R] Include unused factor levels in coalesce() and if_else() output
> -------------------------------------------------------------------
>
>                 Key: ARROW-14649
>                 URL: https://issues.apache.org/jira/browse/ARROW-14649
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Ian Cook
>            Priority: Minor
>
> ARROW-14167 added support for factors in {{{}coalesce(){}}}, but the factors 
> that are returned will not necessarily retain the factor levels like 
> {{coalesce()}} does when used on an R data frame.
> For example, compare these, noticing the difference in the levels:
> {code:r}
> # R data frame
> tibble(x = factor(c("a", NA_character_)), y = factor(c("b", "c"))) %>%
>   mutate(y = coalesce(x, y)) %>%
>   pull(y)
> #> [1] a c
> #> Levels: a b c{code}
> {code:r}
> # Arrow Table
> tibble(x = factor(c("a", NA_character_)), y = factor(c("b", "c"))) %>%
>   Table$create() %>%
>   mutate(y = coalesce(x, y)) %>%
>   pull(y)
> #> [1] a c
> #> Levels: a c{code}
> Similarly, ARROW-13358 and ARROW-14659 added support for factors in 
> {{if_else()}} but the returned factors will not always retain the levels like 
> {{if_else()}} does when used on an R data frame.
> I'm not sure if it is practical to make Arrow return the factors with the 
> unused levels included like R does. If so, we should do it.
> See the tests in {{test-dplyr-funcs-conditional.R}} that refers to this Jira.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to