linrrzqqq opened a new pull request, #64039:
URL: https://github.com/apache/doris/pull/64039

   Problem Summary:
   
   Python UDF process pool initialization previously required the whole pool to 
finish initialization before BE could continue serving the query.
   
   In abnormal environments, Python process startup may hang or take a very 
long time in paths such as:
   
   - `fork` / child process creation
   - waiting for the Python Flight socket to appear
   - terminating and waiting for a failed child process to exit
   
   When one process slot gets stuck, the whole process pool initialization can 
be blocked. As a result, FE may hit the send fragments RPC timeout before BE 
returns a meaningful Python UDF error: `RpcException, msg: timeout when waiting 
for send fragments rpc, query timeout:900, left timeout for this operation:30`.
   
   be.log:
   ```text
   Initializing Python process pool for version 3.8.19 with 8 processes
   Python process pool initialization progress for version 3.8.19: 
waiting_slot=4/8, success=3, failed=0, elapsed_ms=20508
   Python process pool initialization progress for version 3.8.19: 
waiting_slot=4/8, success=3, failed=0, elapsed_ms=40508
   Python process pool initialization progress for version 3.8.19: 
waiting_slot=4/8, success=3, failed=0, elapsed_ms=60508
   Python process pool initialization progress for version 3.8.19: 
waiting_slot=4/8, success=3, failed=0, elapsed_ms=80508
   Python process pool initialization progress for version 3.8.19: 
waiting_slot=4/8, success=3, failed=0, elapsed_ms=100508
   Python process pool initialization progress for version 3.8.19: 
waiting_slot=4/8, success=3, failed=0, elapsed_ms=120508
   ```
   ### Solution
   
   Change Python process pool initialization from "wait until all processes are 
created" to "return once at least one usable process is available".
   
   The pool no longer treats full-size initialization as a prerequisite for 
serving queries. Once one Python process is alive, the current query can 
proceed. Missing or failed process slots are repaired asynchronously by the 
existing health check / repair path.
   
   - Bound process pool initialization time, so BE can return 
`SERVICE_UNAVAILABLE` before FE send fragments RPC timeout.
   - Allow partial pool availability: initialization succeeds as long as one 
usable Python process exists.
   - Mark the first initialization round as completed after success or timeout, 
then rely on health check / repair to fill missing slots.
   - Add bounded wait/reap logic for Python child shutdown to avoid blocking 
indefinitely in `wait`.
   - Protect late init / repair workers from writing back after shutdown, and 
discard late duplicate processes safely.
   - Share repair guarding between foreground repair and health check to avoid 
duplicate repair pressure.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to