Re: spark error with reading parquet file created vis pandas/pyarrow

Wes McKinney Fri, 08 Sep 2017 13:59:59 -0700

Indeed, INT96 is deprecated in the Parquet format. There are other
issues with Spark (it places restrictions on table field names, for
example), so it may be worth adding an option like


pq.write_table(table, where, flavor='spark')

or maybe better

pq.write_table(table, where, flavor='spark-2.2')

and this would set the correct options for that version of Spark.

I created https://issues.apache.org/jira/browse/ARROW-1499 as a place
to discuss further

- Wes


On Fri, Sep 8, 2017 at 4:28 PM, Brian Wylie <[email protected]> wrote:
> Okay,
>
> So after some additional debugging, I can get around this if I set
>
> use_deprecated_int96_timestamps=True
>
> on the pq.write_table(arrow_table, filename, compression=compression,
> use_deprecated_int96_timestamps=True) call.
>
> But that just feels SO wrong....as I'm sure it's deprecated for a reason
> (i.e. this will bite me later and badly)
>
>
> I also see this issue (or at least a related issue) reference in this Jeff
> Knupp blog...
>
> https://www.enigma.com/blog/moving-to-parquet-files-as-a-system-of-record
>
> So shrug... any suggestions are greatly appreciated :)
>
> -Brian
>
> On Fri, Sep 8, 2017 at 12:36 PM, Brian Wylie <[email protected]>
> wrote:
>
>> Apologies if this isn't quite the right place to ask this question, but I
>> figured Wes/others might know right off the bat :)
>>
>>
>> Context:
>> - Mac OSX Laptop
>> - PySpark: 2.2.0
>> - PyArrow: 0.6.0
>> - Pandas: 0.19.2
>>
>> Issue Explanation:
>> - I'm converting my Pandas dataframe to a Parquet file with code very
>> similar to
>>        - http://wesmckinney.com/blog/python-parquet-update/
>> - My Pandas DataFrame has a datetime index:  http_df.index.dtype =
>> dtype('<M8[ns]')
>> - When loading the saved parquet file I get the error below
>> - If I remove that index everything works fine
>>
>> ERROR:
>> - Py4JJavaError: An error occurred while calling o34.parquet.
>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
>> in stage 0.0 (TID 0, localhost, executor driver):
>> org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64
>> (TIMESTAMP_MICROS);
>>
>> Full Code to reproduce:
>>  - https://github.com/Kitware/bat/blob/master/notebooks/Bro_
>> to_Parquet.ipynb
>>
>>
>> Thanks in advance, also big fan of all this stuff... "be the chicken" :)
>>
>> -Brian
>>
>>
>>
>>

Re: spark error with reading parquet file created vis pandas/pyarrow

Reply via email to