hi Lucas, Can you open a JIRA with this information on https://issues.apache.org/jira so we can investigate and resolve there in case a patch is required?
No need to post a reply-to e-mail address -- development discussions should stay on public channels like the mailing list or JIRA. Thanks Wes On Wed, Aug 30, 2017 at 1:50 PM, Lucas Pickup <[email protected]> wrote: > Please reply to: [email protected] > Outlook isn't playing nice. > > Apologies, Lucas Pickup > > -----Original Message----- > From: Lucas Pickup [mailto:[email protected]] > Sent: Wednesday, August 30, 2017 10:47 AM > To: [email protected] > Subject: PyArrow not retaining Parquet metadata > > Hi All, > > I've encounter an issue where PyArrow does not appear to be propagating > datetime metadata from parquet files into the resuling python objects. > > λ python > Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 5 2016, 11:41:13) > [MSC v.1900 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or > "license" for more information. >>>> import pyarrow as pa >>>> import pyarrow.parquet as pq >>>> import pytz >>>> import pandas >>>> from datetime import datetime >>>> >>>> d1 = datetime.strptime('2015-07-05 23:50:00', '%Y-%m-%d %H:%M:%S') >>>> d1 > datetime.datetime(2015, 7, 5, 23, 50) >>>> aware = pytz.utc.localize(d1) >>>> aware > datetime.datetime(2015, 7, 5, 23, 50, tzinfo=<UTC>) >>>> >>>> df = pandas.DataFrame() >>>> df['DateNaive'] = [d1] >>>> df['DateAware'] = [aware] >>>> df > DateNaive DateAware > 0 2015-07-05 23:50:00 2015-07-05 23:50:00+00:00 >>>> >>>> table = pa.Table.from_pandas(df) >>>> table > pyarrow.Table > DateNaive: timestamp[ns] > DateAware: timestamp[ns, tz=UTC] > __index_level_0__: int64 > -- metadata -- > pandas: {"pandas_version": "0.20.3", "columns": [{"name": "DateNaive", > "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, > {"name": "DateAware", "pandas_type": "datetimetz", "numpy_type": > "datetime64[ns, UTC]", "metadata": {"timezone": "UTC"}}], "index_columns": > ["__index_level_0__"]} >>>> >>>> pq.write_table(table, "E:\\pyarrowDates.parquet") >>>> >>>> pyarrowTable = pq.read_table("E:\\pyarrowDates.parquet") >>>> pyarrowTable > pyarrow.Table > DateNaive: timestamp[us] > DateAware: timestamp[us] > __index_level_0__: int64 > -- metadata -- > pandas: {"pandas_version": "0.20.3", "columns": [{"name": "DateNaive", > "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, > {"name": "DateAware", "pandas_type": "datetimetz", "numpy_type": > "datetime64[ns, UTC]", "metadata": {"timezone": "UTC"}}], "index_columns": > ["__index_level_0__"]} >>>> >>>> pyarrowDF = pyarrowTable.to_pandas() pyarrowDF > DateNaive DateAware > 0 2015-07-05 23:50:00 2015-07-05 23:50:00 >>>> > > This was on PyArrow 0.6.0. > > Cheers, Lucas Pickup
