[
https://issues.apache.org/jira/browse/YUNIKORN-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021182#comment-18021182
]
Peter Bacsko commented on YUNIKORN-3089:
----------------------------------------
[~mitdesai] did you have the chance to try it out?
> Web UI shows stale "New" state applications that are no longer present in the
> cluster
> -------------------------------------------------------------------------------------
>
> Key: YUNIKORN-3089
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3089
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: Mit Desai
> Assignee: Peter Bacsko
> Priority: Critical
> Attachments: yunikorn-spark-cd26dba9a9d54b2089eafe73562efc4d.log
>
>
> We are experiencing an issue where the YuniKorn Web UI continues to display
> applications in the *New* state, even though these applications are no longer
> present in the Kubernetes cluster. The list of such stale applications grows
> over time while the scheduler is running, and is cleared only upon a
> scheduler restart. In one instance, we observed this list growing to over
> 1200+ stale applications.
> This issue is reproducible even with the *1.6.3 build* running with the
> *YUNIKORN-3084 patch* applied.
> *Steps to Reproduce:*
> # Create pods that fail immediately due to constraints (e.g., Kyverno policy
> violations).
> # Observe in the Web UI that applications remain in the New state even after
> the pods are deleted from the cluster.
> # Over time, the list of applications in the New state keeps growing.
> # Restarting the scheduler resets the list, but the problem reappears as the
> scheduler continues to run.
> *Obeservations:*
> * Applications remain in the *New* state in the Web UI, even after their
> corresponding pods are deleted from the cluster.
> * The problem appears to be related to the order and timing of create/delete
> events received by the core.
> * When a pod fails immediately (e.g., due to Kyverno policy violations), the
> shim receives both create and delete requests, but the core does not create
> the app in the partition context in time for the delete to be processed.
> * The core eventually receives the create request, but not the corresponding
> delete was received before that, resulting in the application remaining in
> the New state indefinitely.
> * The shim does not take any further action, leaving the application in this
> stale state until a scheduler restart.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]