Indeed, INT96 is deprecated in the Parquet format. There are other
issues with Spark (it places restrictions on table field names, for
example), so it may be worth adding an option like

pq.write_table(table, where, flavor='spark')

or maybe better

pq.write_table(table, where, flavor='spark-2.2')

and this would set the correct options for that version of Spark.

I created https://issues.apache.org/jira/browse/ARROW-1499 as a place
to discuss further

- Wes


On Fri, Sep 8, 2017 at 4:28 PM, Brian Wylie <briford.wy...@gmail.com> wrote:
> Okay,
>
> So after some additional debugging, I can get around this if I set
>
> use_deprecated_int96_timestamps=True
>
> on the pq.write_table(arrow_table, filename, compression=compression,
> use_deprecated_int96_timestamps=True) call.
>
> But that just feels SO wrong....as I'm sure it's deprecated for a reason
> (i.e. this will bite me later and badly)
>
>
> I also see this issue (or at least a related issue) reference in this Jeff
> Knupp blog...
>
> https://www.enigma.com/blog/moving-to-parquet-files-as-a-system-of-record
>
> So shrug... any suggestions are greatly appreciated :)
>
> -Brian
>
> On Fri, Sep 8, 2017 at 12:36 PM, Brian Wylie <briford.wy...@gmail.com>
> wrote:
>
>> Apologies if this isn't quite the right place to ask this question, but I
>> figured Wes/others might know right off the bat :)
>>
>>
>> Context:
>> - Mac OSX Laptop
>> - PySpark: 2.2.0
>> - PyArrow: 0.6.0
>> - Pandas: 0.19.2
>>
>> Issue Explanation:
>> - I'm converting my Pandas dataframe to a Parquet file with code very
>> similar to
>>        - http://wesmckinney.com/blog/python-parquet-update/
>> - My Pandas DataFrame has a datetime index:  http_df.index.dtype =
>> dtype('<M8[ns]')
>> - When loading the saved parquet file I get the error below
>> - If I remove that index everything works fine
>>
>> ERROR:
>> - Py4JJavaError: An error occurred while calling o34.parquet.
>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
>> in stage 0.0 (TID 0, localhost, executor driver):
>> org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64
>> (TIMESTAMP_MICROS);
>>
>> Full Code to reproduce:
>>  - https://github.com/Kitware/bat/blob/master/notebooks/Bro_
>> to_Parquet.ipynb
>>
>>
>> Thanks in advance, also big fan of all this stuff... "be the chicken" :)
>>
>> -Brian
>>
>>
>>
>>

Reply via email to