Re: spark error with reading parquet file created vis pandas/pyarrow

2017-09-14 Thread Wes McKinney
The option pq.write_table(..., flavor='spark') made it into the 0.7.0 release - Wes On Fri, Sep 8, 2017 at 6:28 PM, Julien Le Dem wrote: > The int96 deprecation is slowly bubbling up the stack. There are still > discussions in spark on how to make the change. So for now even though it's > de

Re: spark error with reading parquet file created vis pandas/pyarrow

2017-09-08 Thread Julien Le Dem
The int96 deprecation is slowly bubbling up the stack. There are still discussions in spark on how to make the change. So for now even though it's deprecated it is still used in some places. This should get resolved in the near future. Julien > On Sep 8, 2017, at 14:12, Wes McKinney wrote: >

Re: spark error with reading parquet file created vis pandas/pyarrow

2017-09-08 Thread Wes McKinney
Turning on int96 timestamps is the solution right now. To save yourself some typing, you could declare parquet_options = { 'compression': ..., 'use_deprecated_int96_timestamps': True } pq.write_table(..., **parquet_options) On Fri, Sep 8, 2017 at 5:08 PM, Brian Wylie wrote: > So, this i

Re: spark error with reading parquet file created vis pandas/pyarrow

2017-09-08 Thread Brian Wylie
So, this is certainly good for future versions of Arrow. Do you have any specific recommendations for a workaround currently? Saving a parquet file with datetimes will obviously be a common use case and if I'm understanding it correctly, right now saving a Parquet file with PyArrow that file will

Re: spark error with reading parquet file created vis pandas/pyarrow

2017-09-08 Thread Wes McKinney
Indeed, INT96 is deprecated in the Parquet format. There are other issues with Spark (it places restrictions on table field names, for example), so it may be worth adding an option like pq.write_table(table, where, flavor='spark') or maybe better pq.write_table(table, where, flavor='spark-2.2')

Re: spark error with reading parquet file created vis pandas/pyarrow

2017-09-08 Thread Brian Wylie
Okay, So after some additional debugging, I can get around this if I set use_deprecated_int96_timestamps=True on the pq.write_table(arrow_table, filename, compression=compression, use_deprecated_int96_timestamps=True) call. But that just feels SO wrongas I'm sure it's deprecated for a reaso

spark error with reading parquet file created vis pandas/pyarrow

2017-09-08 Thread Brian Wylie
Apologies if this isn't quite the right place to ask this question, but I figured Wes/others might know right off the bat :) Context: - Mac OSX Laptop - PySpark: 2.2.0 - PyArrow: 0.6.0 - Pandas: 0.19.2 Issue Explanation: - I'm converting my Pandas dataframe to a Parquet file with code very simil