[ https://issues.apache.org/jira/browse/FLINK-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Carl updated FLINK-23730: ------------------------- Description: Our use case is as follows, # hive source: create hive table which meta data is in HMS # create hbase use hbase shell # flink sql ddl: create hbase flink table # use hive catalog: use flink sql insert into hbase flink table if i set the tableconfig: table.exec.hive.infer-source-parallelism = false The program will run as one parallelism,and the number of records of results is correct. but if i set the tableconfig: table.exec.hive.infer-source-parallelism = true The program will run as twenty parallelism that express source parallelism is inferred according to splits number,and the number of records of results is not correct. The test was repeated many times and there was no exception occurred. So I guess it has something to do with high concurrency. Does it lose data because of high concurrency? update1---------------------- The program is as follows, *HBase table :* !image-2021-08-26-09-50-35-061.png! *Flink HBase table:* !image-2021-08-26-09-51-38-899.png! *hive table:* !image-2021-08-26-09-52-35-808.png! modify the two arguments to control the parallelism of flink, *table.exec.hive.infer-source-parallelism* *table.exec.hive.infer-source-parallelism.max* 1. if i set table.exec.hive.infer-source-parallelism=*false*, flink will run as one parallelism, and result is correct. 2. if i set them as follows, flink will run as 10 parallelism, and result is correct. table.exec.hive.infer-source-parallelism=*true* table.exec.hive.infer-source-parallelism.max=*10* 3. but if i set them as follows, flink will run as 20 parallelism, and result is not correct. And the sum rows of hbase is lesss than hive table.exec.hive.infer-source-parallelism=*true* table.exec.hive.infer-source-parallelism.max=*20* the data of hive table as follows !image-2021-08-26-10-05-23-289.png! was: Our use case is as follows, # hive source: create hive table which meta data is in HMS # create hbase use hbase shell # flink sql ddl: create hbase flink table # use hive catalog: use flink sql insert into hbase flink table if i set the tableconfig: table.exec.hive.infer-source-parallelism = false The program will run as one parallelism,and the number of records of results is correct. but if i set the tableconfig: table.exec.hive.infer-source-parallelism = true The program will run as twenty parallelism that express source parallelism is inferred according to splits number,and the number of records of results is not correct. The test was repeated many times and there was no exception occurred. So I guess it has something to do with high concurrency. Does it lose data because of high concurrency? > Source from hive sink hbase lost data > ------------------------------------- > > Key: FLINK-23730 > URL: https://issues.apache.org/jira/browse/FLINK-23730 > Project: Flink > Issue Type: Bug > Components: Connectors / HBase, Connectors / Hive > Affects Versions: 1.12.1 > Reporter: Carl > Priority: Major > Attachments: image-2021-08-26-09-43-39-055.png, > image-2021-08-26-09-44-20-390.png, image-2021-08-26-09-50-35-061.png, > image-2021-08-26-09-51-38-899.png, image-2021-08-26-09-52-35-808.png, > image-2021-08-26-10-05-23-289.png > > > Our use case is as follows, > # hive source: create hive table which meta data is in HMS > # create hbase use hbase shell > # flink sql ddl: create hbase flink table > # use hive catalog: use flink sql insert into hbase flink table > if i set the tableconfig: table.exec.hive.infer-source-parallelism = false > The program will run as one parallelism,and the number of records of results > is correct. > but if i set the tableconfig: table.exec.hive.infer-source-parallelism = true > The program will run as twenty parallelism that express source parallelism is > inferred according to splits number,and the number of records of results is > not correct. > > The test was repeated many times and there was no exception occurred. > > So I guess it has something to do with high concurrency. Does it lose data > because of high concurrency? > > update1---------------------- > > The program is as follows, > > *HBase table :* > !image-2021-08-26-09-50-35-061.png! > > *Flink HBase table:* > !image-2021-08-26-09-51-38-899.png! > > *hive table:* > !image-2021-08-26-09-52-35-808.png! > > modify the two arguments to control the parallelism of flink, > *table.exec.hive.infer-source-parallelism* > *table.exec.hive.infer-source-parallelism.max* > 1. if i set table.exec.hive.infer-source-parallelism=*false*, flink will run > as one parallelism, and result is correct. > 2. if i set them as follows, flink will run as 10 parallelism, and result is > correct. > table.exec.hive.infer-source-parallelism=*true* > table.exec.hive.infer-source-parallelism.max=*10* > 3. but if i set them as follows, flink will run as 20 parallelism, and result > is not correct. And the sum rows of hbase is lesss than hive > table.exec.hive.infer-source-parallelism=*true* > table.exec.hive.infer-source-parallelism.max=*20* > > the data of hive table as follows > > !image-2021-08-26-10-05-23-289.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)