I have a question regarding how tuples are buffered between (possibly chained) subtasks.
Is it correct that there is a buffer for each vertex in the DAG of subtasks? Regardless of task slot sharing? If yes, then the primary optimization in this regard is operator chaining. Furthermore, how do these buffers translate into overhead? Is there a send thread and a receive thread per buffer, similar to Apache Storm? I could not find details concerning such buffers in the relevant subsection under Concepts. Thanks in advance.