[ https://issues.apache.org/jira/browse/FLINK-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404838#comment-17404838 ]
Carl commented on FLINK-23730: ------------------------------ [~luoyuxia] Thanks for you replay. I have update the describe of the issue. > Source from hive sink hbase lost data > ------------------------------------- > > Key: FLINK-23730 > URL: https://issues.apache.org/jira/browse/FLINK-23730 > Project: Flink > Issue Type: Bug > Components: Connectors / HBase, Connectors / Hive > Affects Versions: 1.12.1 > Reporter: Carl > Priority: Major > Attachments: image-2021-08-26-09-43-39-055.png, > image-2021-08-26-09-44-20-390.png, image-2021-08-26-09-50-35-061.png, > image-2021-08-26-09-51-38-899.png, image-2021-08-26-09-52-35-808.png, > image-2021-08-26-10-05-23-289.png > > > Our use case is as follows, > # hive source: create hive table which meta data is in HMS > # create hbase use hbase shell > # flink sql ddl: create hbase flink table > # use hive catalog: use flink sql insert into hbase flink table > if i set the tableconfig: table.exec.hive.infer-source-parallelism = false > The program will run as one parallelism,and the number of records of results > is correct. > but if i set the tableconfig: table.exec.hive.infer-source-parallelism = true > The program will run as twenty parallelism that express source parallelism is > inferred according to splits number,and the number of records of results is > not correct. > > The test was repeated many times and there was no exception occurred. > > So I guess it has something to do with high concurrency. Does it lose data > because of high concurrency? > > update1---------------------- > > The program is as follows, > > *HBase table :* > !image-2021-08-26-09-50-35-061.png! > > *Flink HBase table:* > !image-2021-08-26-09-51-38-899.png! > > *hive table:* > !image-2021-08-26-09-52-35-808.png! > > modify the two arguments to control the parallelism of flink, > *table.exec.hive.infer-source-parallelism* > *table.exec.hive.infer-source-parallelism.max* > 1. if i set table.exec.hive.infer-source-parallelism=*false*, flink will run > as one parallelism, and result is correct. > 2. if i set them as follows, flink will run as 10 parallelism, and result is > correct. > table.exec.hive.infer-source-parallelism=*true* > table.exec.hive.infer-source-parallelism.max=*10* > 3. but if i set them as follows, flink will run as 20 parallelism, and result > is not correct. And the sum rows of hbase is lesss than hive > table.exec.hive.infer-source-parallelism=*true* > table.exec.hive.infer-source-parallelism.max=*20* > > the data of hive table as follows > > !image-2021-08-26-10-05-23-289.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)