[ https://issues.apache.org/jira/browse/HUDI-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Istvan Darvas updated HUDI-3362: -------------------------------- Description: Hi Guys, Environment: AWS EMR 6.4 / Hudi v0.8.0 Problem: I have a CoW table wich is ingested by DeltaStremer (batch style: every 5 minutes from Kafka), and after a certain time, DeltaStremer stops working with a message like this: {{diagnostics: User class threw exception: org.apache.hudi.exception.HoodieRollbackException: Found commits after time :20220131215051, please rollback greater commits first}} It is usually a replace commit, I would say I am pretty sure in this. I have commits in the timeline: 20220131214354<-before 20220131215051<-error message 20220131215514<-after So as it was told to me, I try to rollback with the following steps in hudi-cli: 1.) connect --path s3://scgps-datalake/iot_raw/ingress_pkg_decoded_rep / SUCCESS 2.) savepoint create --commit 20220131214354 --sparkMaster local[2] / SUCCESS 3.) savepoint rollback --savepoint 20220131214354 --sparkMaster local[2] / FAILED 4.) savepoint create --commit 20220131215514 --sparkMaster local[2] / SUCCESS 5.) savepoint rollback --savepoint 20220131215514 --sparkMaster local[2] / FAILED Long story short, if I run a situation like this I am not able to solve it with the known methods ;) - My use-case is working progress, but I cannot go prod with an issue like this. My question, what would be the right steps / commands to solve an issue like this, and be able to restart deltastremer again. This table, does not have dimension data, so I am happy to share the whole table if someone curiuous (if that is needed or would be helpful, lets talk in a private mail / slack about the sharing). ~15GB ;) it was stoped after a few run, actually after the 1st clustering. I use this clustering config in the DeltaStremer: hoodie.clustering.inline=true hoodie.clustering.inline.enabled=true hoodie.clustering.inline.max.commits=36 hoodie.clustering.plan.strategy.sort.columns=correlation_id hoodie.clustering.plan.strategy.daybased.lookback.partitions=7 hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 hoodie.clustering.plan.strategy.small.file.limit=134217728 hoodie.clustering.plan.strategy.max.bytes.per.group=671088640 I hope there is someone who can help me to tackle with this, becase if I able to solve this manually, I would be confident to go prod. So thanks in advance, Darvi Slack Hudi: istvan darvas / U02NTACPHPU was: Hi Guys, Environment: AWS EMR 6.4 / Hudi v0.8.0 Problem: I have a MoR table wich is ingested by DeltaStremer (batch style: every 5 minutes from Kafka), and after a certain time, DeltaStremer stops working with a message like this: {{diagnostics: User class threw exception: org.apache.hudi.exception.HoodieRollbackException: Found commits after time :20220131215051, please rollback greater commits first}} It is usually a replace commit, I would say I am pretty sure in this. I have commits in the timeline: 20220131214354<-before 20220131215051<-error message 20220131215514<-after So as it was told to me, I try to rollback with the following steps in hudi-cli: 1.) connect --path s3://scgps-datalake/iot_raw/ingress_pkg_decoded_rep / SUCCESS 2.) savepoint create --commit 20220131214354 --sparkMaster local[2] / SUCCESS 3.) savepoint rollback --savepoint 20220131214354 --sparkMaster local[2] / FAILED 4.) savepoint create --commit 20220131215514 --sparkMaster local[2] / SUCCESS 5.) savepoint rollback --savepoint 20220131215514 --sparkMaster local[2] / FAILED Long story short, if I run a situation like this I am not able to solve it with the known methods ;) - My use-case is working progress, but I cannot go prod with an issue like this. My question, what would be the right steps / commands to solve an issue like this, and be able to restart deltastremer again. This table, does not have dimension data, so I am happy to share the whole table if someone curiuous (if that is needed or would be helpful, lets talk in a private mail / slack about the sharing). ~15GB ;) it was stoped after a few run, actually after the 1st clustering. I use this clustering config in the DeltaStremer: hoodie.clustering.inline=true hoodie.clustering.inline.enabled=true hoodie.clustering.inline.max.commits=36 hoodie.clustering.plan.strategy.sort.columns=correlation_id hoodie.clustering.plan.strategy.daybased.lookback.partitions=7 hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 hoodie.clustering.plan.strategy.small.file.limit=134217728 hoodie.clustering.plan.strategy.max.bytes.per.group=671088640 I hope there is someone who can help me to tackle with this, becase if I able to solve this manually, I would be confident to go prod. So thanks in advance, Darvi Slack Hudi: istvan darvas / U02NTACPHPU > Hudi 0.8.0 cannot roleback CoW table > ------------------------------------ > > Key: HUDI-3362 > URL: https://issues.apache.org/jira/browse/HUDI-3362 > Project: Apache Hudi > Issue Type: Bug > Affects Versions: 0.8.0 > Reporter: Istvan Darvas > Assignee: sivabalan narayanan > Priority: Blocker > Attachments: hoodie.zip, rollback-on-a-not-damaged-table-SUCCESS.pdf, > rollback-on-a-not-damaged-table-SUCCESS.txt, rollback_20220131215514.txt, > rollback_log.txt, rollback_log_v2.txt > > > Hi Guys, > > Environment: AWS EMR 6.4 / Hudi v0.8.0 > Problem: I have a CoW table wich is ingested by DeltaStremer (batch style: > every 5 minutes from Kafka), and after a certain time, DeltaStremer stops > working with a message like this: > > {{diagnostics: User class threw exception: > org.apache.hudi.exception.HoodieRollbackException: Found commits after time > :20220131215051, please rollback greater commits first}} > > It is usually a replace commit, I would say I am pretty sure in this. > I have commits in the timeline: > > 20220131214354<-before > 20220131215051<-error message > 20220131215514<-after > > So as it was told to me, I try to rollback with the following steps in > hudi-cli: > 1.) connect --path s3://scgps-datalake/iot_raw/ingress_pkg_decoded_rep / > SUCCESS > 2.) savepoint create --commit 20220131214354 --sparkMaster local[2] / SUCCESS > 3.) savepoint rollback --savepoint 20220131214354 --sparkMaster local[2] / > FAILED > 4.) savepoint create --commit 20220131215514 --sparkMaster local[2] / SUCCESS > 5.) savepoint rollback --savepoint 20220131215514 --sparkMaster local[2] / > FAILED > > Long story short, if I run a situation like this I am not able to solve it > with the known methods ;) - My use-case is working progress, but I cannot go > prod with an issue like this. > > My question, what would be the right steps / commands to solve an issue like > this, and be able to restart deltastremer again. > > This table, does not have dimension data, so I am happy to share the whole > table if someone curiuous (if that is needed or would be helpful, lets talk > in a private mail / slack about the sharing). ~15GB ;) it was stoped after a > few run, actually after the 1st clustering. > > I use this clustering config in the DeltaStremer: > hoodie.clustering.inline=true > hoodie.clustering.inline.enabled=true > hoodie.clustering.inline.max.commits=36 > hoodie.clustering.plan.strategy.sort.columns=correlation_id > hoodie.clustering.plan.strategy.daybased.lookback.partitions=7 > hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > hoodie.clustering.plan.strategy.small.file.limit=134217728 > hoodie.clustering.plan.strategy.max.bytes.per.group=671088640 > > I hope there is someone who can help me to tackle with this, becase if I able > to solve this manually, I would be confident to go prod. > So thanks in advance, > Darvi > Slack Hudi: istvan darvas / U02NTACPHPU -- This message was sent by Atlassian Jira (v8.20.1#820001)