dekuu5 commented on issue #20027: URL: https://github.com/apache/datafusion/issues/20027#issuecomment-3818140067
Hello @2010YOUY01, I spent some time investigating this issue. Initially, I wasn't able to reproduce the bug even when running the tests 200 times in parallel. However, I wrote a custom stress-test script to run the test case with much higher concurrency (100 parallel instances), and I was finally able to reproduce the failure consistently. After debugging the reproduction, I identified a race condition in the coordination logic of the SpillPool. The poll_next function relies on a buffered stream (spawn_buffered) to read the file concurrently. The issue is that the background buffer task is not aware of the writer's status. Under heavy load, the buffer task can hit a temporary EOF (before the writer finishes) and quit prematurely. As a result, poll_next receives None from the stream and closes the reader before all batches are processed. i changed the stream to be a normal unbuffered stream and it worked i will open a pr shortly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
