[jira] [Created] (KYLIN-5339) Renew Epoch Retry did not interrupt the old thread in time, and the new thread failed to write data, resulting in kylin losing epoch

sibing.zhang (Jira) Mon, 05 Dec 2022 22:56:06 -0800

sibing.zhang created KYLIN-5339:
-----------------------------------

             Summary: Renew Epoch Retry did not interrupt the old thread in 
time, and the new thread failed to write data, resulting in kylin losing epoch
                 Key: KYLIN-5339
                 URL: https://issues.apache.org/jira/browse/KYLIN-5339
             Project: Kylin
          Issue Type: Bug
    Affects Versions: 5.0-alpha
            Reporter: sibing.zhang
             Fix For: 5.0-alpha
         Attachments: 31c439f4-0a2b-4616-949d-415f4b417f2e.png, 
602360ee-fa81-4c8c-a7d2-4fdd73d284ff.png


epoch renew时有两次retry，每次有超时60s的机制。renew时使用线程池来执行。这个线程池容量由开关 
kylin.server.renew-epoch-pool-size=3决定。这里存在的问题是：renew线程超时60s后没有终止该线程，又拉起了另一个renew线程，对同样的数据进行了更新。此时第一个线程由于没有终止，最后renew成功了，MVCC+1。后面renew的线程renew时，会判断MVCC：

!31c439f4-0a2b-4616-949d-415f4b417f2e.png|width=583,height=64!

此时，发现没有满足条件的数据，导致return的update affectedRows = 
0。最终，造成了当前节点丢失了所有项目的控制权，从而关闭了所有项目的任务调度器。流程可见下图：
 
*!602360ee-fa81-4c8c-a7d2-4fdd73d284ff.png|width=560,height=574!*
*fix design*
Epoch 
Renew有超时失败的重试机制({{{}kylin.server.leader-race.heart-beat-timeout=60s{}}})。重试时，原有的事务没有停止，新开事务进行了数据库更新。由于Epoch
 
更新时，会校验mvcc的值，所以这里导致第二个事务被第一个事务冲突了。鉴于此，增加事务Timeout机制，Timeout={{{}kylin.server.leader-race.heart-beat-timeout=60s{}}}-1s。事务超时自动回滚，避免了Renew重试时事务冲突的问题。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KYLIN-5339) Renew Epoch Retry did not interrupt the old thread in time, and the new thread failed to write data, resulting in kylin losing epoch

Reply via email to