Hi, using the R arrow package version 14.0.2.1, I'm stumped by
something seemingly simple.  For date columns, I like to use R's Date
class, which is stored internally as a number but prints as a
YYYY-MM-DD string.

In most cases arrow handles these Date columns nicely.  The exception
is when I partition on a Date column, as in column "d1" in my example
below.  When I read my data back in with open_dataset(), the d1 column
is now a string instead of Date.  In contrast, the types of all the
other columns are preserved, including my "d2" Date column, because I
did not partition on that one.

It sort of makes sense that d1 is now a string, because the directory
names on disk really are strings like "2024-01-01".  But I'd really
like to convert it back to the Date class format!  In plain R that's
easy, but with the Dataset mmap-ed on disk, I don't know how to do it.

What should I do to get arrow to convert the partitioned d1 column to
Arrow's date32[day] type, and thus back to R's Date class?  Can I
somehow do this directly on the Dataset object itself, WITHOUT first
converting it to ArrowTabular or data.frame?

Thanks for your help!


Example follows:
--------------------------------------------------
require("arrow")
my.dir <- "/tmp/arrow"
# Example data with some Date-class columns:
aa <- do.call("rbind" ,lapply(split(iris ,iris$Species) ,function(xx){
   cbind(head(xx ,5)
        ,d1=(as.Date('2024-01-01') + 0:4)
        ,d2=(as.Date('1980-01-01') + 0:4))
})); rownames(aa) <- NULL
arrow::write_dataset(aa ,my.dir ,partitioning=c('d1') ,hive_style=FALSE 
,format="feather" ,codec=Codec$create("LZ4_FRAME"))
bb <- arrow::open_dataset(my.dir ,format="feather" ,unify_schemas=TRUE 
,partitioning=c('d1'))
# Unfortunately the "d1" column is now a string.

> dim(aa)
[1] 15  7
> class(aa)
[1] "data.frame"

> sapply(aa ,class)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species           d1   
        d2
   "numeric"    "numeric"    "numeric"    "numeric"     "factor"       "Date"   
    "Date"
> sapply(aa ,storage.mode)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species           d1   
        d2
    "double"     "double"     "double"     "double"    "integer"     "double"   
  "double"

> dim(bb)
[1] 15  7
> class(bb)
[1] "FileSystemDataset" "Dataset"           "ArrowObject"       "R6"

> bb$schema$d1
Field
d1: string
> bb$schema$d2
Field
d2: date32[day]

> bb
FileSystemDataset with 5 Feather files
Sepal.Length: double
Sepal.Width: double
Petal.Length: double
Petal.Width: double
Species: dictionary<values=string, indices=int8>
d2: date32[day]
d1: string

See $metadata for additional Schema metadata

> sapply(arrow:::as.data.frame.ArrowTabular(bb$NewScan()$Finish()$ToTable()) 
> ,class)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species           d2   
        d1
   "numeric"    "numeric"    "numeric"    "numeric"     "factor"       "Date"  
"character"
--------------------------------------------------

-- 
Andrew Piskorski <a...@piskorski.com>

Reply via email to