[jira] [Commented] (KAFKA-7790) Trogdor - Does not time out tasks in time

ASF GitHub Bot (JIRA) Tue, 08 Jan 2019 02:36:08 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736985#comment-16736985
 ]


ASF GitHub Bot commented on KAFKA-7790:
---------------------------------------

stanislavkozlovski commented on pull request #6103: KAFKA-7790: Expire Trogdor 
tasks
URL: https://github.com/apache/kafka/pull/6103
 
 
   This commit changes a Trogdor agent/coordinator's behavior to not run tasks 
that have expired. We define an expired task as one whose sum of `startedMs` 
and `durationMs` is less than the current time in milliseconds.
   
   Validation is done both in the Coordinator and in the Agent. An expired task 
is marked as `DONE` with an error of `"worker expired"`
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Trogdor - Does not time out tasks in time
> -----------------------------------------
>
>                 Key: KAFKA-7790
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7790
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Stanislav Kozlovski
>            Assignee: Stanislav Kozlovski
>            Priority: Major
>
> All Trogdor task specifications have a defined `startMs` and `durationMs`. 
> Under conditions of task failure and restarts, it is intuitive to assume that 
> a task would not be re-ran after a certain time period.
> Let's best illustrate the issue with an example:
> {code:java}
> startMs = 12PM; durationMs = 1hour;
> # 12:02 - Coordinator schedules a task to run on agent-0
> # 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail.
> # 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it 
> re-schedules tasks that are not running in agent-0
> # 13:20 - agent-0 process dies.
> # 13:22 - agent-0 comes back up. Coordinator re-schedules task{code}
> This can result in an endless loop of task rescheduling. If there are more 
> tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we 
> can end up in a scenario where we overwhelm the agent with tasks that we 
> would rather have dropped.
> h2. Changes
> We propose that the Trogdor Coordinator does not re-schedule a task if the 
> current time of re-scheduling is greater than the start time of the task and 
> its duration combined. More specifically:
> {code:java}
> if (currentTimeMs > startTimeMs + durationTimeMs)
>   scheduleTask()
> else
>   failTask(){code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-7790) Trogdor - Does not time out tasks in time

Reply via email to