[ 
https://issues.apache.org/jira/browse/IMPALA-12187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18011979#comment-18011979
 ] 

ASF subversion and git services commented on IMPALA-12187:
----------------------------------------------------------

Commit 447c016ae18bd89902ff8ac2cd3a5298360c0d50 in impala's branch 
refs/heads/master from Sai Hemanth Gantasala
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=447c016ae ]

IMPALA-12187: Fix flaky test_event_based_replication()

TestEventProcessing.test_event_based_replication is turning flaky when
there is a lag replication of a database that has too many events to
replicate. The case III in the test is turning flaky because the event
processor has to processes so many ALTER_PARTITIONS events that valid
writeId list can be inaccurate when the replication is not complete.
So a 20 sec timeout is introduced in case III after replication so
that event processor will process events after replication process is
completely done.

Testing:
- Looped the test 100 times to avoid flakiness

Change-Id: I89fcd951f6a65ab7fe97c4f23554d93d9ba12f4e
Reviewed-on: http://gerrit.cloudera.org:8080/22131
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Riza Suminto <[email protected]>


> TestEventProcessing.test_event_based_replication flaky for truncate table
> -------------------------------------------------------------------------
>
>                 Key: IMPALA-12187
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12187
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>    Affects Versions: Impala 4.3.0
>            Reporter: Joe McDonnell
>            Assignee: Sai Hemanth Gantasala
>            Priority: Critical
>              Labels: broken-build, flaky
>
> There have been a couple Jenkins jobs that have seen a failure on 
> TestEventProcessing.test_event_based_replication() where the test is 
> expecting the truncated table to have zero rows, but instead the table has 
> 100 rows:
> {noformat}
> metadata/test_event_processing.py:180: in test_event_based_replication
>     self.__run_event_based_replication_tests()
> metadata/test_event_processing.py:329: in __run_event_based_replication_tests
>     assert rows_in_part_tbl_target == 0
> E   assert 100 == 0{noformat}
> More logs:
> {noformat}
> truncate table repl_source_tsmyd.part_tbl;
> -- 2023-06-02 06:44:19,049 INFO     MainThread: Started query 
> 50469ac62856f797:53e74fb400000000
> -- 2023-06-02 06:44:41,638 INFO     MainThread: Waiting until events 
> processor syncs to event id:32187
> -- 2023-06-02 06:44:42,596 DEBUG    MainThread: Metric last-synced-event-id 
> has reached the desired value: 32187
> -- 2023-06-02 06:44:42,632 DEBUG    MainThread: Found 3 impalad/1 
> statestored/1 catalogd process(es)
> -- 2023-06-02 06:44:42,648 INFO     MainThread: Getting metric: 
> catalog.curr-version from hostname:25000
> -- 2023-06-02 06:44:42,651 INFO     MainThread: Sleeping 1s before next retry.
> -- 2023-06-02 06:44:43,653 INFO     MainThread: Getting metric: 
> catalog.curr-version from hostname:25000
> -- 2023-06-02 06:44:43,669 INFO     MainThread: Sleeping 1s before next retry.
> -- 2023-06-02 06:44:44,670 INFO     MainThread: Getting metric: 
> catalog.curr-version from hostname:25000
> -- 2023-06-02 06:44:44,674 INFO     MainThread: Sleeping 1s before next retry.
> -- 2023-06-02 06:44:45,676 INFO     MainThread: Getting metric: 
> catalog.curr-version from hostname:25000
> -- 2023-06-02 06:44:45,679 INFO     MainThread: Sleeping 1s before next retry.
> -- 2023-06-02 06:44:46,680 INFO     MainThread: Getting metric: 
> catalog.curr-version from hostname:25000
> -- 2023-06-02 06:44:46,683 INFO     MainThread: Sleeping 1s before next retry.
> -- 2023-06-02 06:44:47,685 INFO     MainThread: Getting metric: 
> catalog.curr-version from hostname:25000
> -- 2023-06-02 06:44:47,688 INFO     MainThread: Metric 'catalog.curr-version' 
> has reached desired value: 9771
> -- 2023-06-02 06:44:47,688 INFO     MainThread: Getting metric: 
> catalog.curr-version from hostname:25001
> -- 2023-06-02 06:44:47,691 INFO     MainThread: Metric 'catalog.curr-version' 
> has reached desired value: 9771
> -- 2023-06-02 06:44:47,691 INFO     MainThread: Getting metric: 
> catalog.curr-version from hostname:25002
> -- 2023-06-02 06:44:47,694 INFO     MainThread: Metric 'catalog.curr-version' 
> has reached desired value: 9771
> -- executing against localhost:21000
> select count(*) from repl_target_hhkuw.unpart_tbl;
> -- 2023-06-02 06:44:47,697 INFO     MainThread: Started query 
> 6c40644e00cdf143:3be5e75a00000000
> -- executing against localhost:21000
> select count(*) from repl_target_hhkuw.part_tbl;{noformat}
> This was seen in a debug core job and a debug erasure coding job. Only for 
> the partitioned table and not the unpartitioned table.
> This seems like a symptom that doesn't match the existing flakiness for 
> TestEventProcessing.test_event_based_replication().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to