Hi all, I've been messing around with Spark and PyArrow Parquet reading. In my testing I've found that a Parquet file written by Spark containing a datetime column, results in different datetimes from Spark and PyArrow.
The attached script demonstrates this. Output: Spark Reading the parquet file into a DataFrame: [Row(Date=datetime.datetime(2015, 7, 5, 23, 50)), Row(Date=datetime.datetime(2015, 7, 5, 23, 30))] PyArrow table has dates as UTC (7 hours ahead) <pyarrow.lib.TimestampArray object at 0x0000029F3AFE79A8> [ Timestamp('2015-07-06 06:50:00') ] Pandas DF from pyarrow table has dates as UTC (7 hours ahead) Date 0 2015-07-06 06:50:00 1 2015-07-06 06:30:00 I would've expected to end up with the same datetime from both readers since there was no timezone attached at any point. It just a date and time value. Am I missing anything here? Or is this a bug. Cheers, Lucas Pickup