Re: spark error with reading parquet file created vis pandas/pyarrow

Brian Wylie Fri, 08 Sep 2017 13:29:16 -0700

Okay,

So after some additional debugging, I can get around this if I set


use_deprecated_int96_timestamps=True

on the pq.write_table(arrow_table, filename, compression=compression,
use_deprecated_int96_timestamps=True) call.

But that just feels SO wrong....as I'm sure it's deprecated for a reason
(i.e. this will bite me later and badly)


I also see this issue (or at least a related issue) reference in this Jeff
Knupp blog...

https://www.enigma.com/blog/moving-to-parquet-files-as-a-system-of-record

So shrug... any suggestions are greatly appreciated :)

-Brian

On Fri, Sep 8, 2017 at 12:36 PM, Brian Wylie <briford.wy...@gmail.com>
wrote:

> Apologies if this isn't quite the right place to ask this question, but I
> figured Wes/others might know right off the bat :)
>
>
> Context:
> - Mac OSX Laptop
> - PySpark: 2.2.0
> - PyArrow: 0.6.0
> - Pandas: 0.19.2
>
> Issue Explanation:
> - I'm converting my Pandas dataframe to a Parquet file with code very
> similar to
>        - http://wesmckinney.com/blog/python-parquet-update/
> - My Pandas DataFrame has a datetime index:  http_df.index.dtype =
> dtype('<M8[ns]')
> - When loading the saved parquet file I get the error below
> - If I remove that index everything works fine
>
> ERROR:
> - Py4JJavaError: An error occurred while calling o34.parquet.
> : org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
> in stage 0.0 (TID 0, localhost, executor driver):
> org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64
> (TIMESTAMP_MICROS);
>
> Full Code to reproduce:
>  - https://github.com/Kitware/bat/blob/master/notebooks/Bro_
> to_Parquet.ipynb
>
>
> Thanks in advance, also big fan of all this stuff... "be the chicken" :)
>
> -Brian
>
>
>
>

Re: spark error with reading parquet file created vis pandas/pyarrow

Reply via email to