[ 
https://issues.apache.org/jira/browse/YUNIKORN-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932115#comment-17932115
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-3038:
-------------------------------------------------

This has just been fixed as YUNIKORN-3036. A race condition was reintroduced 
giving us a nil for a queue that caused the crash.

> Nil pointer dereference on GetQueuePath
> ---------------------------------------
>
>                 Key: YUNIKORN-3038
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3038
>             Project: Apache YuniKorn
>          Issue Type: Bug
>            Reporter: Thomas Cassaert
>            Priority: Major
>
> We're observing quite some occurences of following panic:
> {code:java}
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 
> pc=0x1b037c9]goroutine 81 [running]:
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).GetQueuePath(0x0)
>     
> github.com/apache/yunikorn-core@v1.6.1-2/pkg/scheduler/objects/queue.go:548 
> +0x29
> github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0xc00b4d0b60,
>  0xc0135d9a40)
>     github.com/apache/yunikorn-core@v1.6.1-2/pkg/scheduler/partition.go:1501 
> +0xf1c
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc0000fa080,
>  {0xc0146546a0, 0x1, 0xc0000a82a0?}, {0xc00645a340, 0x9})
>     github.com/apache/yunikorn-core@v1.6.1-2/pkg/scheduler/context.go:780 
> +0xa9
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc00b9ebf88?,
>  0xc015e3ff08?)
>     github.com/apache/yunikorn-core@v1.6.1-2/pkg/scheduler/context.go:716 
> +0x5d
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc0004002d0)
>     github.com/apache/yunikorn-core@v1.6.1-2/pkg/scheduler/scheduler.go:133 
> +0x18e
> created by 
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
> goroutine 1
>     github.com/apache/yunikorn-core@v1.6.1-2/pkg/scheduler/scheduler.go:60 
> +0x9c{code}
> This is always preceded by 
> {code:java}
> 2025-03-03T14:35:07.138Z        WARN    core.scheduler.partition        
> scheduler/partition.go:1483     failed to release resources from queue  
> {"appID": "spark-045e89205af548a6b2661e82fd3a0704", "allocationKey": 
> "bb7808f0-4a77-4469-a394-86a9de766609", "error": "queue is nil"} {code}
> On inspection of the mentioned appID, we do see a queue defined:
> {code:java}
> annotations:
>     yunikorn.apache.org/task-groups: 
> '[{"name":"spark-driver","minMember":1,"minResource":{"cpu":"1","memory":"5120Mi"},"labels":{"deploy_env":"prod","driver-type":"batch","job_id":"j-250303112642423baa5a25458a7b2037","name":"2503031126-d
> river","openeo-role":"batch-driver","openeo_component":"batchjobs","queue":"root.default","role":"driver","user_id":"9e001a2a-1186-4b46-8f90-0f44cbcb13a9","version":"3.2.0"}},{"name":"spark-executor","minMember":2,"minResource":{"cpu":"50
> 0.0m","memory":"6920Mi"},"labels":{"deploy_env":"prod","job_id":"j-250303112642423baa5a25458a7b2037","openeo-role":"executor","openeo_component":"batchjobs","queue":"root.default","user_id":"9e001a2a-1186-4b46-8f90-0f44cbcb13a9","version"
> :"3.2.0"}}]'{code}
> Also, the label `queue: root.default` exists
> {*}Yunikorn version{*}: 1.6.1
> {*}Yunikorn config{*}:
> {code:java}
>    queues.yaml: |
>     partitions:
>       - name: default
>         placementrules:
>           - name: provided
>             create: true
>         queues:
>           - name: root
>             parent: true
>             submitacl: "*"
>             queues:
>               - name: default
>                 parent: false
>                 properties:
>                   preemption.policy: disabled
>               - name: cdse-prod
>                 parent: true
>                 queues:
>                   - name: batch
>                     childtemplate:
>                       maxapplications: 2
>                       properties:
>                         preemption.policy: disabled
>   service.clusterId: cdse-prod{code}
> {*}Kubernetes version{*}: 1.25.7



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to