[jira] [Commented] (ARROW-12960) [C++][R] Option for is_nan(null) to evaluate to false or true

Dewey Dunnington (Jira) Fri, 17 Dec 2021 11:27:42 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-12960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461623#comment-17461623
 ]


Dewey Dunnington commented on ARROW-12960:
------------------------------------------

Reprex:

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

record_batch(
  dbl = c(1, NA, NaN),
  float = Array$create(c(1, NA, NaN))$cast(float32())
) %>% 
  transmute(
    dbl_is_nan = is.nan(dbl),
    float_is_nan = is.nan(float)
  ) %>% 
  collect()
#> # A tibble: 3 × 2
#>   dbl_is_nan float_is_nan
#>   <lgl>      <lgl>       
#> 1 FALSE      FALSE       
#> 2 FALSE      FALSE       
#> 3 TRUE       TRUE
{code}


Where this lives in the package:

https://github.com/apache/arrow/blob/master/r/R/dplyr-functions.R#L96-L105

https://github.com/apache/arrow/blob/master/r/tests/testthat/test-dplyr-funcs-type.R#L182-L203

https://github.com/apache/arrow/blob/master/r/src/compute.cpp#L129-L130

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h#L295-L302



> [C++][R] Option for is_nan(null) to evaluate to false or true
> -------------------------------------------------------------
>
>                 Key: ARROW-12960
>                 URL: https://issues.apache.org/jira/browse/ARROW-12960
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, R
>            Reporter: Ian Cook
>            Assignee: Christian Cordova
>            Priority: Major
>              Labels: good-first-issue, kernel
>             Fix For: 7.0.0
>
>
> (This is the flip side of ARROW-12959.)
> Currently the Arrow compute kernel {{is_nan}} always treats {{null}} as a 
> missing value, returning {{null}} at positions of the input datum with 
> {{null}} (missing) values.
> It would be helpful to be able to control this behavior with an option. The 
> option could be named {{value_for_null}} or something similar and it would 
> take a nullable boolean scalar.  It would default to {{null}}, consistent 
> with current behavior. When set to {{false}} or {{true}}, it would return 
> {{false}} or {{true}} at positions of the input datum with {{null}} values.
> Among other things, this would enable the {{arrow}} R package to evaluate 
> {{is.nan()}} consistently with the way base R does. In base R, {{is.nan()}} 
> returns {{FALSE}} on {{NA}}. But in the {{arrow}} R package, it returns 
> {{NA}}:
> {code:r}
> > is.nan(c(3.14, NA, NaN))
> ##[1] FALSE FALSE  TRUE
> as.vector(is.nan(Array$create(c(3.14, NA, NaN))))
> ##[1] FALSE    NA  TRUE{code}
>  I think solving this with an option in the C++ kernel is the best solution, 
> because I suspect there are other cases in which users would want the ability 
> to return all non-missing values in the output from {{is_nan}} without 
> needing to call another kernel to fill the missing values in. However, it 
> would also be possible to solve this just in the R package, by changing the 
> mapping of {{is.nan}} in the R package. If we choose to go that route, we 
> should change this Jira issue summary to "[R] Make is.nan(NA) consistent with 
> base R".



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-12960) [C++][R] Option for is_nan(null) to evaluate to false or true

Reply via email to