Re: Change default timestamp offset on data load

Mich Talebzadeh Thu, 07 Sep 2023 01:37:08 -0700

Hi,

As far as I am aware there is no Spark or JVM setting that can make Spark
assume a different timezone during the initial load from Parquet as Parquet
files store timestamps in UTC. The timezone conversion can be done (as I
described before) after the load.


HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 7 Sept 2023 at 01:42, Jack Goodson <jackagood...@gmail.com> wrote:

> Thanks Mich, sorry, I might have been a bit unclear in my original email.
> The timestamps are getting loaded as 2003-11-24T09:02:32+0000 for example
> but I want it loaded as 2003-11-24T09:02:32+1300 I know how to do this
> with various transformations however I'm wondering if there's any spark or
> jvm settings that I can change so it assumes +1300 (as the time in the
> column is relative to NZ local time not UTC) on load instead of +0000. I
> inspected the parquet column with my created date with pyarrow with the
> below results.
>
> I had a look in here
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md and
> it looks like I need isAdjustedUTC=false (maybe?) but am at a loss on how
> to set it
>
> <pyarrow._parquet.ColumnChunkMetaData object at 0xdeadbeef>
>
>   file_offset: 6019
>
>   file_path:
>
>   physical_type: INT96
>
>   num_values: 4
>
>   path_in_schema: created
>
>   is_stats_set: False
>
>   statistics:
>
>     None
>
>   compression: SNAPPY
>
>   encodings: ('BIT_PACKED', 'PLAIN', 'RLE')
>
>   has_dictionary_page: False
>
>   dictionary_page_offset: None
>
>   data_page_offset: 6019
>
>   total_compressed_size: 90
>
>   total_uncompressed_size: 103
>
> On Wed, Sep 6, 2023 at 8:14 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Hi Jack,
>>
>> You may use from_utc_timestamp and to_utc_timestamp to see if they help.
>>
>> from pyspark.sql.functions import from_utc_timestamp
>>
>> You can read your Parquet file into DF
>>
>> df = spark.read.parquet('parquet_file_path')
>>
>> # Convert timestamps (assuming your column name) from UTC to
>> Pacific/Auckland timezone
>>
>> df_with_local_timezone = df.withColumn( 'timestamp',
>> from_utc_timestamp(df['timestamp'], 'Pacific/Auckland') )
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed.
>> The author will in no case be liable for any monetary damages arising from
>> such loss, damage or destruction.
>>
>>
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 6 Sept 2023 at 04:19, Jack Goodson <jackagood...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I've got a number of tables that I'm loading in from a SQL server. The
>>> timestamp in SQL server is stored like 2003-11-24T09:02:32 I get these
>>> as parquet files in our raw storage location and pick them up in
>>> Databricks. When I load the data in databricks, the dataframe/spark assumes
>>> UTC or +0000 on the timestamp like 2003-11-24T09:02:32+0000 the time
>>> and date is the same as what's in SQL server however the offset is
>>> incorrect
>>>
>>> I've tried various methods like the below code to set the JVM timezone
>>> to my local timezone but when viewing the data it seems to just subtract
>>> the offset from the timestamp and add it to the offset part like 
>>> 2003-11-24T09:02:32+0000
>>> -> 2003-11-23T20:02:32+1300 (NZ has a +13 offset in winter)
>>>
>>> spark = pyspark.sql.SparkSession \
>>> .Builder()\
>>> .appName('test') \
>>> .master('local') \
>>> .config('spark.driver.extraJavaOptions',
>>> '-Duser.timezone=Pacific/Auckland') \
>>> .config('spark.executor.extraJavaOptions',
>>> '-Duser.timezone=Pacific/Auckland') \
>>> .config('spark.sql.session.timeZone', 'Pacific/Auckland') \
>>> .getOrCreate()
>>>
>>>
>>>
>>> I understand that in Parquet these are stored as UNIX time and aren't
>>> timezone aware, however are there any settings that I can set in spark that
>>> would implicitly convert/assume the timestamp from 2003-11-24T09:02:32+0000
>>> to 2003-11-24T09:02:32+1300 I know this can be done with
>>> transformations however I'm trying to avoid doing transformations for every
>>> timestamp on 100's tables
>>>
>>> Any help much appreciated, thanks,
>>>
>>> Jack
>>>
>>>
>>>
>>>

Re: Change default timestamp offset on data load

Reply via email to