[jira] [Updated] (FLINK-23730) Source from hive sink hbase lost data

Carl (Jira) Wed, 25 Aug 2021 19:06:05 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Carl updated FLINK-23730:
-------------------------
    Description: 
Our use case is as follows,
 # hive source: create hive table which meta data is in HMS
 # create hbase use hbase shell
 # flink sql ddl: create hbase flink table
 # use hive catalog: use flink sql insert into hbase flink table

if i set the tableconfig:  table.exec.hive.infer-source-parallelism = false

The program will run as one parallelism，and the number of records of results is 
correct.

but if i set the tableconfig:  table.exec.hive.infer-source-parallelism = true

The program will run as twenty parallelism that express source parallelism is 
inferred according to splits number，and the number of records of results is not 
correct.

 

The test was repeated many times and there was no exception occurred.

 

So I guess it has something to do with high concurrency. Does it lose data 
because of high concurrency?

 

 update1----------------------

 

The program is as follows,

 

*HBase table :*

!image-2021-08-26-09-50-35-061.png!

 

*Flink HBase table:*

!image-2021-08-26-09-51-38-899.png!

 

 *hive table:*

!image-2021-08-26-09-52-35-808.png!

 

modify the two arguments to control the parallelism of flink,

*table.exec.hive.infer-source-parallelism*

*table.exec.hive.infer-source-parallelism.max*

1. if i set table.exec.hive.infer-source-parallelism=*false*, flink will run as 
one parallelism, and result is correct.

2. if i set them as follows, flink will run as 10 parallelism, and result is 
correct.

table.exec.hive.infer-source-parallelism=*true*

table.exec.hive.infer-source-parallelism.max=*10*

3. but if i set them as follows, flink will run as 20 parallelism, and result 
is not correct. And the sum rows of hbase is lesss than hive

table.exec.hive.infer-source-parallelism=*true*

table.exec.hive.infer-source-parallelism.max=*20*

 

the data of hive table as follows

 

!image-2021-08-26-10-05-23-289.png!

 

  was:
Our use case is as follows,
 # hive source: create hive table which meta data is in HMS
 # create hbase use hbase shell
 # flink sql ddl: create hbase flink table
 # use hive catalog: use flink sql insert into hbase flink table

if i set the tableconfig:  table.exec.hive.infer-source-parallelism = false

The program will run as one parallelism，and the number of records of results is 
correct.

but if i set the tableconfig:  table.exec.hive.infer-source-parallelism = true

The program will run as twenty parallelism that express source parallelism is 
inferred according to splits number，and the number of records of results is not 
correct.

 

The test was repeated many times and there was no exception occurred.

 

So I guess it has something to do with high concurrency. Does it lose data 
because of high concurrency?

 

 

 


> Source from hive sink hbase lost data
> -------------------------------------
>
>                 Key: FLINK-23730
>                 URL: https://issues.apache.org/jira/browse/FLINK-23730
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / HBase, Connectors / Hive
>    Affects Versions: 1.12.1
>            Reporter: Carl
>            Priority: Major
>         Attachments: image-2021-08-26-09-43-39-055.png, 
> image-2021-08-26-09-44-20-390.png, image-2021-08-26-09-50-35-061.png, 
> image-2021-08-26-09-51-38-899.png, image-2021-08-26-09-52-35-808.png, 
> image-2021-08-26-10-05-23-289.png
>
>
> Our use case is as follows,
>  # hive source: create hive table which meta data is in HMS
>  # create hbase use hbase shell
>  # flink sql ddl: create hbase flink table
>  # use hive catalog: use flink sql insert into hbase flink table
> if i set the tableconfig:  table.exec.hive.infer-source-parallelism = false
> The program will run as one parallelism，and the number of records of results 
> is correct.
> but if i set the tableconfig:  table.exec.hive.infer-source-parallelism = true
> The program will run as twenty parallelism that express source parallelism is 
> inferred according to splits number，and the number of records of results is 
> not correct.
>  
> The test was repeated many times and there was no exception occurred.
>  
> So I guess it has something to do with high concurrency. Does it lose data 
> because of high concurrency?
>  
>  update1----------------------
>  
> The program is as follows,
>  
> *HBase table :*
> !image-2021-08-26-09-50-35-061.png!
>  
> *Flink HBase table:*
> !image-2021-08-26-09-51-38-899.png!
>  
>  *hive table:*
> !image-2021-08-26-09-52-35-808.png!
>  
> modify the two arguments to control the parallelism of flink,
> *table.exec.hive.infer-source-parallelism*
> *table.exec.hive.infer-source-parallelism.max*
> 1. if i set table.exec.hive.infer-source-parallelism=*false*, flink will run 
> as one parallelism, and result is correct.
> 2. if i set them as follows, flink will run as 10 parallelism, and result is 
> correct.
> table.exec.hive.infer-source-parallelism=*true*
> table.exec.hive.infer-source-parallelism.max=*10*
> 3. but if i set them as follows, flink will run as 20 parallelism, and result 
> is not correct. And the sum rows of hbase is lesss than hive
> table.exec.hive.infer-source-parallelism=*true*
> table.exec.hive.infer-source-parallelism.max=*20*
>  
> the data of hive table as follows
>  
> !image-2021-08-26-10-05-23-289.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-23730) Source from hive sink hbase lost data

Reply via email to