[ 
https://issues.apache.org/jira/browse/SPARK-51667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-51667:
------------------------------------

    Assignee: Jungtaek Lim

> [TWS + Python] Disable Nagle's algorithm between Python worker and State 
> Server
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-51667
>                 URL: https://issues.apache.org/jira/browse/SPARK-51667
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 4.0.0, 4.1.0
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Major
>              Labels: pull-request-available
>
> During testing TWS + Python, we figured out the case where the socket 
> communication for state interaction had delayed for more than 40ms, for 
> certain type of state, e.g. ListState.put(), ListState.get(), 
> ListState.appendList(), etcetc.
> The root cause is figured out as the combination of Nagle's algorithm and 
> delayed ACK. The sequence is following:
>  # Python worker sends the proto message to JVM, and flushes the socket.
>  # Additionally, Python worker sends the follow-up data to JVM, and flushes 
> the socket.
>  # JVM reads the proto message, and realizes there is follow-up data.
>  # JVM reads the follow-up data.
>  # JVM processes the request, and sends the response back to Python worker.
> Due to delayed ACK, even after 3, ACK is not sent back from JVM to Python 
> worker. It is waiting for some data or multiple ACKs to be sent, but JVM is 
> not going to send the data during that phase.
> Due to Nagle's algorithm, the message from 2 is not sent to JVM since there 
> is no ACK for the message from 1.
> This deadlock situation is resolved after the timeout of delayed ACK, which 
> is 40ms (minimum duration) in Linux. After the timeout, ACK is sent back from 
> JVM to Python worker, hence Nagle's algorithm allows the message from 2 to be 
> finally sent to JVM.
> See below articles for more general explanation:
>  * [https://engineering.avast.io/40-millisecond-bug/]
>  ** Start reading from Nagle's algorithm section
>  * [https://brooker.co.za/blog/2024/05/09/nagle.html]
> Nagle's algorithm helps to reduce a lot of small packets, which the above 
> article states it could help the router from overloaded. We connect to 
> "localhost" here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to