rguillome opened a new issue #5094:
URL: https://github.com/apache/hudi/issues/5094


   Steps to reproduce the behavior:
   
   1. From Spark datasource, launch debezium-like records : they have an `Op` 
field to indicate if it's an Insert, Update or Delete
   Assume that for some records, the `Op` value is `D` (delete)
   2. Then Upsert all the dataframe in a Hudi managed table, with a 
`COPY_ON_WRITE` storage type
   All the records with a `Op` = `D` are soft deleted (they still come up with 
queries but all their columns are empty - except the hudi metadata) 
   3. A user query all the data of this table and count the records
   5. Its count could be false since is counting deleted operation. If It 
doesn't want he should filtre on `Op` <> `D`
   
   **Expected behavior**
   
   I would expect the user not to be bother by any metadata columns. So I think 
we should be able to hard delete at the same time of the other upsert 
operations to offer a coherent view to the end user.
   
   I already try to dig a little, asking information from 
[here](https://medium.com/@rguillome/hi-9aa98d3196e1) for example.
   My thought is that philosophically, Hudi let the user to configure 
operations at a dataframe level (using it with Spark). So it might be either 
one new configuration to specify that deletes must be hard when upserting or 
maybe a new kind of operation "UPSERT_WITH_DELETE".
   
   Regards,
   
   
   
   **Environment Description**
   
   * Hudi version : 0.9.0
   
   * Spark version : 3.1.2
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to