Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-09-05 Thread Bryan Cutler
Hi Lucas, The assessments from Wes and Li are right on. Just to add to that, and unfortunately make things even more complicated.. Spark does not always use the config "spark.sql.session.timeZone", so it doesn't really help with your example. It would be used if instead you generated timestamps t

Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-29 Thread Li Jin
Lucas, Wes' explanation is correct. If you are using Spark 2.2, you can set spark config "spark.sql.session.timeZone" to "UTC". I have written an documentation explaining this. I can clean it up for ARROW-1425. On Mon, Aug 28, 2017 at 5:23 PM, Wes McKinney wrote: > see https://issues.apache.or

Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-28 Thread Wes McKinney
see https://issues.apache.org/jira/browse/ARROW-1425 On Mon, Aug 28, 2017 at 12:32 PM, Wes McKinney wrote: > hi Lucas, > > Bryan Cutler, Holden Karau, Li Jin, or someone with deeper knowledge > of the Spark timestamp issue (which is a known, and not a bug per se) > should be able to give some ext

Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-28 Thread Wes McKinney
hi Lucas, Bryan Cutler, Holden Karau, Li Jin, or someone with deeper knowledge of the Spark timestamp issue (which is a known, and not a bug per se) should be able to give some extra context about this. My understanding is that when you read timezone-naive data in Spark, it is treated as session-

Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-28 Thread Lucas Pickup
Here is the pyspark script I used to see this difference. On Mon, 28 Aug 2017 at 09:20 Lucas Pickup wrote: > Hi all, > > Very sorry if people already responded to this at: > lucas.pic...@microsoft.com There was an INVALID identifier attached to > the end of the reply address for some reason whic

RE: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-25 Thread Lucas Pickup
Quick follow up. I'm trying to work around this myself in the meantime. The goal is to qualify the TimestampValue with a timezone (by creating a new column in the arrow table based off the previous one). If this can be done before the Value's are converted to python it may fix the issue I was ha