[ https://issues.apache.org/jira/browse/HUDI-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461443#comment-17461443 ]
sivabalan narayanan commented on HUDI-2971: ------------------------------------------- Here is what we can do here. non-row writer path and row writer path spits out diff values for timestamp column with logical type which is what needs to be fixed. We can't just go ahead fix either of them, bcoz, there could be users who are using older releases and when they upgrade to latest, for the same record, key gen could spit up diff values and hence upserts could be converted to inserts. So, we got to be careful on how we can fix the inconsistency. There could be 3 types of users who might be using hudi in general. a. using non-writer paths fully. b. using row writer paths fully (immutable data) c. using a mix of non row writer and row writer. Atleast we want to ensure users for (a) and (b) have a way to continue using as is. (c) anyways their data may not be consistent, we can give them a way to migrate. Having said this, here is what we can do. Introduce a new config named "hoodie.generate.consistent.timestamp.logical.for.key.generator" (we can debate the naming) And when this flag is enabled, both row writer and non-row writer path should be generating consistent values. If not enabled, we will fallback to existing behavior (inconsistent). So, (a) and (b) users above can choose to stay as is w/o enabling the config. For newer datasets, they can choose to enable this flag and can use both operations (row writer and non-row writer) at their will. (c) users can recreate their dataset and by enabling this new config or choose to stay as is. since their existing dataset might have duplicates anyways, its better to do a migration. Only reason I don't want to enable the new config by default is, either (a) or (b) users will run into inconsistencies w/o knowing if we flip by default. [~ryanpife] [~wenningd] [~codope] : Open to hear your thoughts. > Timestamp values being corrupted when using BULK INSERT with row writing > enabled > -------------------------------------------------------------------------------- > > Key: HUDI-2971 > URL: https://issues.apache.org/jira/browse/HUDI-2971 > Project: Apache Hudi > Issue Type: Bug > Affects Versions: 0.9.0 > Reporter: Ryan Pifer > Assignee: Sagar Sumit > Priority: Blocker > Fix For: 0.11.0 > > > We found that after performing bulk inserts with data that included > Timestamps that after performing other write operations on the table that the > Timestamps of records from the initial load were all corrupted. We narrowed > this down to when row writing is enabled which uses Spark Datasource V2. In > Hudi 0.9.0 row writing is enabled by default. > Performing 2 inserts on new table `ts_ts` match in both records (expected > results) > {code:java} > scala> > spark.read.format("hudi").load("s3://ryanpife-emr-dev/hudi/data/hudi090/timestamp/2/").show() > +-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+-------------------+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|version|partition| ts_string| > ts_ts| > +-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+-------------------+ > | 20211022233434| 20211022233434_0_1| 101| > 2019|0db6c29d-5291-4f7...|101| 1| 2019|2021-05-07 > 00:00:00|2021-05-07 00:00:00| > | 20211022233556| 20211022233556_0_1| 102| > 2019|0db6c29d-5291-4f7...|102| 2| 2019|2021-05-07 > 00:00:00|2021-05-07 00:00:00| > +-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+-------------------+ > {code} > > Performing bulk insert, then insert `ts_ts` do not match in records > (corrupted result) > {code:java} > +-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+--------------------+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|version|partition| ts_string| > ts_ts| > +-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+--------------------+ > | 20211022232152| 20211022232152_0_1| 104| > 2019|dbdc2dd9-e870-4cf...|104| 4| 2019|2021-05-07 > 00:00:00|1970-01-19 18:05:...| > | 20211022232441| 20211022232441_0_1| 105| > 2019|dbdc2dd9-e870-4cf...|105| 5| 2019|2021-05-07 00:00:00| > 2021-05-07 00:00:00| > +-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+--------------------+{code} -- This message was sent by Atlassian Jira (v8.20.1#820001)