hi Lucas,

Can you open a JIRA with this information on
https://issues.apache.org/jira so we can investigate and resolve there
in case a patch is required?

No need to post a reply-to e-mail address -- development discussions
should stay on public channels like the mailing list or JIRA.

Thanks
Wes

On Wed, Aug 30, 2017 at 1:50 PM, Lucas Pickup
<[email protected]> wrote:
> Please reply to: [email protected]
> Outlook isn't playing nice.
>
> Apologies, Lucas Pickup
>
> -----Original Message-----
> From: Lucas Pickup [mailto:[email protected]]
> Sent: Wednesday, August 30, 2017 10:47 AM
> To: [email protected]
> Subject: PyArrow not retaining Parquet metadata
>
> Hi All,
>
> I've encounter an issue where PyArrow does not appear to be propagating 
> datetime metadata from parquet files into the resuling python objects.
>
> λ python
> Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:41:13) 
> [MSC v.1900 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or 
> "license" for more information.
>>>> import pyarrow as pa
>>>> import pyarrow.parquet as pq
>>>> import pytz
>>>> import pandas
>>>> from datetime import datetime
>>>>
>>>> d1 = datetime.strptime('2015-07-05 23:50:00', '%Y-%m-%d %H:%M:%S')
>>>> d1
> datetime.datetime(2015, 7, 5, 23, 50)
>>>> aware = pytz.utc.localize(d1)
>>>> aware
> datetime.datetime(2015, 7, 5, 23, 50, tzinfo=<UTC>)
>>>>
>>>> df = pandas.DataFrame()
>>>> df['DateNaive'] = [d1]
>>>> df['DateAware'] = [aware]
>>>> df
>             DateNaive                 DateAware
> 0 2015-07-05 23:50:00 2015-07-05 23:50:00+00:00
>>>>
>>>> table  = pa.Table.from_pandas(df)
>>>> table
> pyarrow.Table
> DateNaive: timestamp[ns]
> DateAware: timestamp[ns, tz=UTC]
> __index_level_0__: int64
> -- metadata --
> pandas: {"pandas_version": "0.20.3", "columns": [{"name": "DateNaive", 
> "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, 
> {"name": "DateAware", "pandas_type": "datetimetz", "numpy_type": 
> "datetime64[ns, UTC]", "metadata": {"timezone": "UTC"}}], "index_columns": 
> ["__index_level_0__"]}
>>>>
>>>> pq.write_table(table, "E:\\pyarrowDates.parquet")
>>>>
>>>> pyarrowTable = pq.read_table("E:\\pyarrowDates.parquet")
>>>> pyarrowTable
> pyarrow.Table
> DateNaive: timestamp[us]
> DateAware: timestamp[us]
> __index_level_0__: int64
> -- metadata --
> pandas: {"pandas_version": "0.20.3", "columns": [{"name": "DateNaive", 
> "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, 
> {"name": "DateAware", "pandas_type": "datetimetz", "numpy_type": 
> "datetime64[ns, UTC]", "metadata": {"timezone": "UTC"}}], "index_columns": 
> ["__index_level_0__"]}
>>>>
>>>> pyarrowDF = pyarrowTable.to_pandas() pyarrowDF
>             DateNaive           DateAware
> 0 2015-07-05 23:50:00 2015-07-05 23:50:00
>>>>
>
> This was on PyArrow 0.6.0.
>
> Cheers, Lucas Pickup

Reply via email to