[jira] [Commented] (FLINK-23730) Source from hive sink hbase lost data

Carl (Jira) Wed, 25 Aug 2021 19:07:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404838#comment-17404838
 ]


Carl commented on FLINK-23730:
------------------------------

[~luoyuxia] Thanks for you replay. I have update the describe of the issue. 

> Source from hive sink hbase lost data
> -------------------------------------
>
>                 Key: FLINK-23730
>                 URL: https://issues.apache.org/jira/browse/FLINK-23730
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / HBase, Connectors / Hive
>    Affects Versions: 1.12.1
>            Reporter: Carl
>            Priority: Major
>         Attachments: image-2021-08-26-09-43-39-055.png, 
> image-2021-08-26-09-44-20-390.png, image-2021-08-26-09-50-35-061.png, 
> image-2021-08-26-09-51-38-899.png, image-2021-08-26-09-52-35-808.png, 
> image-2021-08-26-10-05-23-289.png
>
>
> Our use case is as follows,
>  # hive source: create hive table which meta data is in HMS
>  # create hbase use hbase shell
>  # flink sql ddl: create hbase flink table
>  # use hive catalog: use flink sql insert into hbase flink table
> if i set the tableconfig:  table.exec.hive.infer-source-parallelism = false
> The program will run as one parallelism，and the number of records of results 
> is correct.
> but if i set the tableconfig:  table.exec.hive.infer-source-parallelism = true
> The program will run as twenty parallelism that express source parallelism is 
> inferred according to splits number，and the number of records of results is 
> not correct.
>  
> The test was repeated many times and there was no exception occurred.
>  
> So I guess it has something to do with high concurrency. Does it lose data 
> because of high concurrency?
>  
>  update1----------------------
>  
> The program is as follows,
>  
> *HBase table :*
> !image-2021-08-26-09-50-35-061.png!
>  
> *Flink HBase table:*
> !image-2021-08-26-09-51-38-899.png!
>  
>  *hive table:*
> !image-2021-08-26-09-52-35-808.png!
>  
> modify the two arguments to control the parallelism of flink,
> *table.exec.hive.infer-source-parallelism*
> *table.exec.hive.infer-source-parallelism.max*
> 1. if i set table.exec.hive.infer-source-parallelism=*false*, flink will run 
> as one parallelism, and result is correct.
> 2. if i set them as follows, flink will run as 10 parallelism, and result is 
> correct.
> table.exec.hive.infer-source-parallelism=*true*
> table.exec.hive.infer-source-parallelism.max=*10*
> 3. but if i set them as follows, flink will run as 20 parallelism, and result 
> is not correct. And the sum rows of hbase is lesss than hive
> table.exec.hive.infer-source-parallelism=*true*
> table.exec.hive.infer-source-parallelism.max=*20*
>  
> the data of hive table as follows
>  
> !image-2021-08-26-10-05-23-289.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-23730) Source from hive sink hbase lost data

Reply via email to