[
https://issues.apache.org/jira/browse/HUDI-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan updated HUDI-2774:
--------------------------------------
Description:
Setup:
Started deltastreamer with parquet dfs source. source folder did not have any
data as such. Enabled async clustering with below props
```
hoodie.clustering.async.max.commits=2
hoodie.clustering.plan.strategy.sort.columns=type,id
```
Added 1 file to the source folder. and deltastreamer failed during this. commit
went through fine. looks like 1st replace commit also went through fine. but
deltastreamer failed. I need to understand why deltastreamer tries to schedule
a 2nd replace commit as well.
clustering plan seems to be same in both requested meta files
{code:java}
grep "2b9b3f9d-f68c-4404-8352-1708089d2cca-0_13-49-202_20211116123721000"
/tmp/hudi-deltastreamer-gh-mw/.hoodie/* | grep replacecommit
Binary file
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123724586.replacecommit.requested
matches
Binary file
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123725199.replacecommit.requested
matches {code}
timeline
!Screen Shot 2021-11-16 at 12.42.20 PM.png!
was:
Setup:
Started deltastreamer with parquet dfs source. source folder did not have any
data as such. Enabled async clustering with below props
```
hoodie.clustering.async.max.commits=2
hoodie.clustering.plan.strategy.sort.columns=type,id
```
Added 1 file to the source folder. and deltastreamer failed during this. commit
went through fine. 1st replace commit got scheduled. but the timeline shows 2nd
one as well.
clustering plan seems to be same in both requested meta files
{code:java}
grep "2b9b3f9d-f68c-4404-8352-1708089d2cca-0_13-49-202_20211116123721000"
/tmp/hudi-deltastreamer-gh-mw/.hoodie/* | grep replacecommit
Binary file
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123724586.replacecommit.requested
matches
Binary file
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123725199.replacecommit.requested
matches {code}
timeline
!Screen Shot 2021-11-16 at 12.42.20 PM.png!
> Async Clustering via deltstreamer fails with IllegalStateException: Duplicate
> key [==>20211116123724586__replacecommit__INFLIGHT]
> ---------------------------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-2774
> URL: https://issues.apache.org/jira/browse/HUDI-2774
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: sivabalan narayanan
> Assignee: Sagar Sumit
> Priority: Blocker
> Fix For: 0.10.0
>
> Attachments: Screen Shot 2021-11-16 at 12.42.20 PM.png
>
>
> Setup:
> Started deltastreamer with parquet dfs source. source folder did not have any
> data as such. Enabled async clustering with below props
> ```
> hoodie.clustering.async.max.commits=2
> hoodie.clustering.plan.strategy.sort.columns=type,id
> ```
> Added 1 file to the source folder. and deltastreamer failed during this.
> commit went through fine. looks like 1st replace commit also went through
> fine. but deltastreamer failed. I need to understand why deltastreamer tries
> to schedule a 2nd replace commit as well.
>
> clustering plan seems to be same in both requested meta files
> {code:java}
> grep "2b9b3f9d-f68c-4404-8352-1708089d2cca-0_13-49-202_20211116123721000"
> /tmp/hudi-deltastreamer-gh-mw/.hoodie/* | grep replacecommit
> Binary file
> /tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123724586.replacecommit.requested
> matches
> Binary file
> /tmp/hudi-deltastreamer-gh-mw/.hoodie/20211116123725199.replacecommit.requested
> matches {code}
>
>
> timeline
> !Screen Shot 2021-11-16 at 12.42.20 PM.png!
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)