rguillome opened a new issue #5094: URL: https://github.com/apache/hudi/issues/5094
Steps to reproduce the behavior: 1. From Spark datasource, launch debezium-like records : they have an `Op` field to indicate if it's an Insert, Update or Delete Assume that for some records, the `Op` value is `D` (delete) 2. Then Upsert all the dataframe in a Hudi managed table, with a `COPY_ON_WRITE` storage type All the records with a `Op` = `D` are soft deleted (they still come up with queries but all their columns are empty - except the hudi metadata) 3. A user query all the data of this table and count the records 5. Its count could be false since is counting deleted operation. If It doesn't want he should filtre on `Op` <> `D` **Expected behavior** I would expect the user not to be bother by any metadata columns. So I think we should be able to hard delete at the same time of the other upsert operations to offer a coherent view to the end user. I already try to dig a little, asking information from [here](https://medium.com/@rguillome/hi-9aa98d3196e1) for example. My thought is that philosophically, Hudi let the user to configure operations at a dataframe level (using it with Spark). So it might be either one new configuration to specify that deletes must be hard when upserting or maybe a new kind of operation "UPSERT_WITH_DELETE". Regards, **Environment Description** * Hudi version : 0.9.0 * Spark version : 3.1.2 * Hive version : 3.1.2 * Hadoop version : 3.2.1 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
