Re: R Date class lost when column used for partitioning

Bryce Mecum Fri, 01 Mar 2024 16:40:22 -0800

Hi Andrew, thanks for the question.

Try specifying a schema to `open_dataset` with d1 specified as
date32[day]. When I do that, I get the correct type for that field and
the values look correct too.


schm <- schema(bb)
new_schm <- schm$SetField(6, arrow::field("d1", arrow::date32()))
bb <- arrow::open_dataset(..., schema = new_schm)

On Tue, Feb 27, 2024 at 7:59 PM Andrew Piskorski <a...@piskorski.com> wrote:
>
> Hi, using the R arrow package version 14.0.2.1, I'm stumped by
> something seemingly simple.  For date columns, I like to use R's Date
> class, which is stored internally as a number but prints as a
> YYYY-MM-DD string.
>
> In most cases arrow handles these Date columns nicely.  The exception
> is when I partition on a Date column, as in column "d1" in my example
> below.  When I read my data back in with open_dataset(), the d1 column
> is now a string instead of Date.  In contrast, the types of all the
> other columns are preserved, including my "d2" Date column, because I
> did not partition on that one.
>
> It sort of makes sense that d1 is now a string, because the directory
> names on disk really are strings like "2024-01-01".  But I'd really
> like to convert it back to the Date class format!  In plain R that's
> easy, but with the Dataset mmap-ed on disk, I don't know how to do it.
>
> What should I do to get arrow to convert the partitioned d1 column to
> Arrow's date32[day] type, and thus back to R's Date class?  Can I
> somehow do this directly on the Dataset object itself, WITHOUT first
> converting it to ArrowTabular or data.frame?
>
> Thanks for your help!
>
>
> Example follows:
> --------------------------------------------------
> require("arrow")
> my.dir <- "/tmp/arrow"
> # Example data with some Date-class columns:
> aa <- do.call("rbind" ,lapply(split(iris ,iris$Species) ,function(xx){
>    cbind(head(xx ,5)
>         ,d1=(as.Date('2024-01-01') + 0:4)
>         ,d2=(as.Date('1980-01-01') + 0:4))
> })); rownames(aa) <- NULL
> arrow::write_dataset(aa ,my.dir ,partitioning=c('d1') ,hive_style=FALSE 
> ,format="feather" ,codec=Codec$create("LZ4_FRAME"))
> bb <- arrow::open_dataset(my.dir ,format="feather" ,unify_schemas=TRUE 
> ,partitioning=c('d1'))
> # Unfortunately the "d1" column is now a string.
>
> > dim(aa)
> [1] 15  7
> > class(aa)
> [1] "data.frame"
>
> > sapply(aa ,class)
> Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species           d1 
>           d2
>    "numeric"    "numeric"    "numeric"    "numeric"     "factor"       "Date" 
>       "Date"
> > sapply(aa ,storage.mode)
> Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species           d1 
>           d2
>     "double"     "double"     "double"     "double"    "integer"     "double" 
>     "double"
>
> > dim(bb)
> [1] 15  7
> > class(bb)
> [1] "FileSystemDataset" "Dataset"           "ArrowObject"       "R6"
>
> > bb$schema$d1
> Field
> d1: string
> > bb$schema$d2
> Field
> d2: date32[day]
>
> > bb
> FileSystemDataset with 5 Feather files
> Sepal.Length: double
> Sepal.Width: double
> Petal.Length: double
> Petal.Width: double
> Species: dictionary<values=string, indices=int8>
> d2: date32[day]
> d1: string
>
> See $metadata for additional Schema metadata
>
> > sapply(arrow:::as.data.frame.ArrowTabular(bb$NewScan()$Finish()$ToTable()) 
> > ,class)
> Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species           d2 
>           d1
>    "numeric"    "numeric"    "numeric"    "numeric"     "factor"       "Date" 
>  "character"
> --------------------------------------------------
>
> --
> Andrew Piskorski <a...@piskorski.com>

Re: R Date class lost when column used for partitioning

Reply via email to