n3nash commented on issue #2995:
URL: https://github.com/apache/hudi/issues/2995#issuecomment-851715268


   @jtmzheng Thanks for the very detailed information that helps to understand 
the problem. Let me answer some of your questions inline before attempting to 
debug the underlying root cause of duplicates. 
   
   > **Expected behavior**
   > 
   > Same behavior as Hudi 0.6 but now using the metadata table to track 
files/partitions. Happy to provide whatever info I can.
   > 
   > Questions:
   > 
   > 1. What is causing these duplicates to occur? Since no errors happened as 
far as I can tell, what info can I look at to debug/RCA? I’ve verified there 
are no duplicates (ie. checked some partitions) on 0.6 dataset.
   Can you check if the files from the dataset is the same as in the metadata 
table ? This would require you to 
   
   1. Perform a listing on the entire dataset and get the unique files
   2. Read the metadata table and get the list of files from there
   3. Diff the 2 results to see if there are files in your dataset which are 
not present in the metadata table. 
   
   Additionally, if you have the logs from the first application where you 
turned on the `hoodie.metadata.enable` flag to true. Can you grep for the 
following log lines:
   
   `Creating a new metadata table in`
   `Initializing metadata table by using file listings in`
   `files to metadata`
   
   Please share the output of the grep of the above lines. 
   
   > 2. How can the metadata table be inspected? I can’t tell from 
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements
   
   You can use the `MetadataCommand` to inspect this using the Hudi CLI. You 
can read about how to initialize the CLI here -> 
https://hudi.apache.org/docs/deployment.html#cli
   
   > 3. Should `hoodie.metadata.validate` be enabled? My understanding is this 
is a “dry run” config where S3 file listing will still happen as before while 
also updating the metadata table
   
   That is correct. In this scenario, enabling that before the duplicates 
happened would've helped. But it doesn't help to enable this after the 
corruption has happened. At a high level, you don't need to enable it unless 
this issue is fully reproducible. 
   
   > 4. How do we recover when duplicates occur? I see “records deduplicate” is 
suggested in https://hudi.apache.org/docs/deployment.html#duplicates (NB: seems 
like this should be “repair deduplicate”?), do we need to turn off ingestion 
first and then run over every affected partition?
   
   Yes. You will need to turn off ingestion first before deduping. 
Unfortunately, this command is pretty old and has not been maintained since we 
don't see duplicate issues. You might need to re-bootstrap you dataset to 
recover this or start the shadow pipeline fresh. 
   
   > 5. How do we recover if the metadata table is corrupted? Should we delete 
the existing metadata table from the CLI and recreate? Is this safe to do?
   
   Yes. You can delete the existing metadata table using the CLI or manually 
but you need to stop ingestion during this time. It is safe to do this. Then, 
you should disable `hoodie.metadata.enable` if there is a corruption since the 
code may have had bug. After this, you can resume your pipeline. 
   
   > 6. What upgrade path is suggested from 0.6 to 0.7 with metadata table 
enabled? Should the metadata table be created from the CLI pre-ingestion and 
then starting up the consumer after?
   
   As long as you don't have concurrently running jobs, simply turning on the 
metadata table before the next ingestion run suffices. No need to stop the job. 
   
   Please share your logs to help debug this problem further. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to