Dominic Dennenmoser created ARROW-8813:
------------------------------------------

             Summary: Implementing tidyr interface
                 Key: ARROW-8813
                 URL: https://issues.apache.org/jira/browse/ARROW-8813
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Dominic Dennenmoser


I think it would be reasonable to implement an interface to the {{tidyr}} 
package. The implementation would allow to lazily process ArrowTables before 
put it back into the memory. However, currently you need to collect the table 
first before applying tidyr methods. The following code chunk shows an example 
routine:
{code:r}
library(magrittr)
arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
nested_df <-
   arrow_table %>%
   dplyr::select(ID, 4:7, Value) %>%
   dplyr::filter(Value >= 5) %>%
   dplyr::group_by(ID) %>%
   dplyr::collect() %>%
   tidyr::nest(){code}
The main focus might be the following three methods:
 * {{tidyr::[un]nest()}},
 * {{tidyr::pivot_[longer|wider]()}}, and
 * {{tidyr::seperate()}}.

I suppose the last two can be fairly quickly implemented, but {{tidyr::nest()}} 
and {{tidyr::unnest()}} cannot be implement before conversion to List<Struct> 
will be accessible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to