Okay, So after some additional debugging, I can get around this if I set
use_deprecated_int96_timestamps=True on the pq.write_table(arrow_table, filename, compression=compression, use_deprecated_int96_timestamps=True) call. But that just feels SO wrong....as I'm sure it's deprecated for a reason (i.e. this will bite me later and badly) I also see this issue (or at least a related issue) reference in this Jeff Knupp blog... https://www.enigma.com/blog/moving-to-parquet-files-as-a-system-of-record So shrug... any suggestions are greatly appreciated :) -Brian On Fri, Sep 8, 2017 at 12:36 PM, Brian Wylie <briford.wy...@gmail.com> wrote: > Apologies if this isn't quite the right place to ask this question, but I > figured Wes/others might know right off the bat :) > > > Context: > - Mac OSX Laptop > - PySpark: 2.2.0 > - PyArrow: 0.6.0 > - Pandas: 0.19.2 > > Issue Explanation: > - I'm converting my Pandas dataframe to a Parquet file with code very > similar to > - http://wesmckinney.com/blog/python-parquet-update/ > - My Pandas DataFrame has a datetime index: http_df.index.dtype = > dtype('<M8[ns]') > - When loading the saved parquet file I get the error below > - If I remove that index everything works fine > > ERROR: > - Py4JJavaError: An error occurred while calling o34.parquet. > : org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 > in stage 0.0 (TID 0, localhost, executor driver): > org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 > (TIMESTAMP_MICROS); > > Full Code to reproduce: > - https://github.com/Kitware/bat/blob/master/notebooks/Bro_ > to_Parquet.ipynb > > > Thanks in advance, also big fan of all this stuff... "be the chicken" :) > > -Brian > > > >