Hi Andrew, thanks for the question. Try specifying a schema to `open_dataset` with d1 specified as date32[day]. When I do that, I get the correct type for that field and the values look correct too.
schm <- schema(bb) new_schm <- schm$SetField(6, arrow::field("d1", arrow::date32())) bb <- arrow::open_dataset(..., schema = new_schm) On Tue, Feb 27, 2024 at 7:59 PM Andrew Piskorski <a...@piskorski.com> wrote: > > Hi, using the R arrow package version 14.0.2.1, I'm stumped by > something seemingly simple. For date columns, I like to use R's Date > class, which is stored internally as a number but prints as a > YYYY-MM-DD string. > > In most cases arrow handles these Date columns nicely. The exception > is when I partition on a Date column, as in column "d1" in my example > below. When I read my data back in with open_dataset(), the d1 column > is now a string instead of Date. In contrast, the types of all the > other columns are preserved, including my "d2" Date column, because I > did not partition on that one. > > It sort of makes sense that d1 is now a string, because the directory > names on disk really are strings like "2024-01-01". But I'd really > like to convert it back to the Date class format! In plain R that's > easy, but with the Dataset mmap-ed on disk, I don't know how to do it. > > What should I do to get arrow to convert the partitioned d1 column to > Arrow's date32[day] type, and thus back to R's Date class? Can I > somehow do this directly on the Dataset object itself, WITHOUT first > converting it to ArrowTabular or data.frame? > > Thanks for your help! > > > Example follows: > -------------------------------------------------- > require("arrow") > my.dir <- "/tmp/arrow" > # Example data with some Date-class columns: > aa <- do.call("rbind" ,lapply(split(iris ,iris$Species) ,function(xx){ > cbind(head(xx ,5) > ,d1=(as.Date('2024-01-01') + 0:4) > ,d2=(as.Date('1980-01-01') + 0:4)) > })); rownames(aa) <- NULL > arrow::write_dataset(aa ,my.dir ,partitioning=c('d1') ,hive_style=FALSE > ,format="feather" ,codec=Codec$create("LZ4_FRAME")) > bb <- arrow::open_dataset(my.dir ,format="feather" ,unify_schemas=TRUE > ,partitioning=c('d1')) > # Unfortunately the "d1" column is now a string. > > > dim(aa) > [1] 15 7 > > class(aa) > [1] "data.frame" > > > sapply(aa ,class) > Sepal.Length Sepal.Width Petal.Length Petal.Width Species d1 > d2 > "numeric" "numeric" "numeric" "numeric" "factor" "Date" > "Date" > > sapply(aa ,storage.mode) > Sepal.Length Sepal.Width Petal.Length Petal.Width Species d1 > d2 > "double" "double" "double" "double" "integer" "double" > "double" > > > dim(bb) > [1] 15 7 > > class(bb) > [1] "FileSystemDataset" "Dataset" "ArrowObject" "R6" > > > bb$schema$d1 > Field > d1: string > > bb$schema$d2 > Field > d2: date32[day] > > > bb > FileSystemDataset with 5 Feather files > Sepal.Length: double > Sepal.Width: double > Petal.Length: double > Petal.Width: double > Species: dictionary<values=string, indices=int8> > d2: date32[day] > d1: string > > See $metadata for additional Schema metadata > > > sapply(arrow:::as.data.frame.ArrowTabular(bb$NewScan()$Finish()$ToTable()) > > ,class) > Sepal.Length Sepal.Width Petal.Length Petal.Width Species d2 > d1 > "numeric" "numeric" "numeric" "numeric" "factor" "Date" > "character" > -------------------------------------------------- > > -- > Andrew Piskorski <a...@piskorski.com>