[ https://issues.apache.org/jira/browse/ARROW-12603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335679#comment-17335679 ]
David Li edited comment on ARROW-12603 at 4/29/21, 4:59 PM: ------------------------------------------------------------ Thanks for the bug report & the reproduction case! In this case, it looks like it's already been fixed on the master branch (i.e. for 5.0.0) in ARROW-12500: {noformat} > ds %>% select(target) %>% collect() # A tibble: 53,000 x 1 target <chr> 1 1 wk ahead inc case 2 1 wk ahead inc case 3 1 wk ahead inc case 4 1 wk ahead inc case 5 1 wk ahead inc case 6 1 wk ahead inc case 7 1 wk ahead inc case 8 1 wk ahead inc case 9 2 wk ahead inc case 10 2 wk ahead inc case # … with 52,990 more rows {noformat} Are you able to try the development release? was (Author: lidavidm): Thanks for the bug report & the reproduction case! In this case, it looks like it's already been fixed on the master branch (i.e. for 5.0.0) in ARROW-12500: {noformat} > ds %>% select(target) %>% collect() # A tibble: 53,000 x 1 target <chr> 1 1 wk ahead inc case 2 1 wk ahead inc case 3 1 wk ahead inc case 4 1 wk ahead inc case 5 1 wk ahead inc case 6 1 wk ahead inc case 7 1 wk ahead inc case 8 1 wk ahead inc case 9 2 wk ahead inc case 10 2 wk ahead inc case # … with 52,990 more rows {noformat} > [R] open_dataset ignoring provided schema when using select > ----------------------------------------------------------- > > Key: ARROW-12603 > URL: https://issues.apache.org/jira/browse/ARROW-12603 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 4.0.0 > Environment: R version 4.0.5 (2021-03-31) > Platform: x86_64-pc-linux-gnu (64-bit) > Reporter: Eu Jing Chua > Priority: Major > > While the following snippet works with arrow 3.0.0, it fails after updating > to arrow 4.0.0. > An example CSV that can be used to replicate this can be found > [here|https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/data-processed/Karlen-pypm/2021-04-25-Karlen-pypm.csv] > {code:bash} > . > ├── data > │ └── 2021-04-25-Karlen-pypm.csv > └── test.R > {code} > {code:r} > library(arrow) > library(tidyverse) > sch <- schema(forecast_date=string(), > target=string(), > target_end_date=string(), > location=string(), > type=string(), > quantile=string(), > value=string()) > ds = open_dataset("data", format = "csv", schema = sch) > ds %>% select(target) %>% collect() > {code} > The error is: > {{Error: Invalid: In CSV column #3: CSV conversion error to int64: invalid > value 'US'}} > However, it should be noted that these all run well and return a data frame > with the right schema. > {code:r} > ds %>% collect() > ds %>% select(target, location) %>% collect() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)