[jira] [Commented] (HUDI-2971) Timestamp values being corrupted when using BULK INSERT with row writing enabled

sivabalan narayanan (Jira) Fri, 17 Dec 2021 05:44:07 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461443#comment-17461443
 ]


sivabalan narayanan commented on HUDI-2971:
-------------------------------------------

Here is what we can do here. 

non-row writer path and row writer path spits out diff values for timestamp 
column with logical type which is what needs to be fixed. We can't just go 
ahead fix either of them, bcoz, there could be users who are using older 
releases and when they upgrade to latest, for the same record, key gen could 
spit up diff values and hence upserts could be converted to inserts. So, we got 
to be careful on how we can fix the inconsistency. 

 

There could be 3 types of users who might be using hudi in general. 

a. using non-writer paths fully. 

b. using row writer paths fully (immutable data)

c. using a mix of non row writer and row writer. 

 

Atleast we want to ensure users for (a) and (b) have a way to continue using as 
is. (c) anyways their data may not be consistent, we can give them a way to 
migrate. 

 

Having said this, here is what we can do. 

Introduce a new config named 
"hoodie.generate.consistent.timestamp.logical.for.key.generator" (we can debate 
the naming)

And when this flag is enabled, both row writer and non-row writer path should 
be generating consistent values. If not enabled, we will fallback to existing 
behavior (inconsistent). 

So, (a) and (b) users above can choose to stay as is w/o enabling the config. 
For newer datasets, they can choose to enable this flag and can use both 
operations (row writer and non-row writer) at their will.

(c) users can recreate their dataset and by enabling this new config or choose 
to stay as is. since their existing dataset might have duplicates anyways, its 
better to do a migration. 

 

Only reason I don't want to enable the new config by default is, either (a) or 
(b) users will run into inconsistencies w/o knowing if we flip by default. 

 

[~ryanpife] [~wenningd] [~codope] : Open to hear your thoughts. 

 

> Timestamp values being corrupted when using BULK INSERT with row writing 
> enabled
> --------------------------------------------------------------------------------
>
>                 Key: HUDI-2971
>                 URL: https://issues.apache.org/jira/browse/HUDI-2971
>             Project: Apache Hudi
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Ryan Pifer
>            Assignee: Sagar Sumit
>            Priority: Blocker
>             Fix For: 0.11.0
>
>
> We found that after performing bulk inserts with data that included 
> Timestamps that after performing other write operations on the table that the 
> Timestamps of records from the initial load were all corrupted. We narrowed 
> this down to when row writing is enabled which uses Spark Datasource V2. In 
> Hudi 0.9.0 row writing is enabled by default.
> Performing 2 inserts on new table `ts_ts` match in both records (expected 
> results)
> {code:java}
> scala> 
> spark.read.format("hudi").load("s3://ryanpife-emr-dev/hudi/data/hudi090/timestamp/2/").show()
> +-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+-------------------+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|version|partition|          ts_string|              
> ts_ts|
> +-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+-------------------+
> |     20211022233434|  20211022233434_0_1|               101|                 
>  2019|0db6c29d-5291-4f7...|101|      1|     2019|2021-05-07 
> 00:00:00|2021-05-07 00:00:00|
> |     20211022233556|  20211022233556_0_1|               102|                 
>  2019|0db6c29d-5291-4f7...|102|      2|     2019|2021-05-07 
> 00:00:00|2021-05-07 00:00:00|
> +-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+-------------------+
> {code}
>  
> Performing bulk insert, then insert `ts_ts` do not match in records 
> (corrupted result)
> {code:java}
> +-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+--------------------+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|version|partition|          ts_string|               
> ts_ts|
> +-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+--------------------+
> |     20211022232152|  20211022232152_0_1|               104|                 
>  2019|dbdc2dd9-e870-4cf...|104|      4|     2019|2021-05-07 
> 00:00:00|1970-01-19 18:05:...|
> |     20211022232441|  20211022232441_0_1|               105|                 
>  2019|dbdc2dd9-e870-4cf...|105|      5|     2019|2021-05-07 00:00:00| 
> 2021-05-07 00:00:00|
> +-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+--------------------+{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-2971) Timestamp values being corrupted when using BULK INSERT with row writing enabled

Reply via email to