yuxiqian opened a new pull request, #3680:
URL: https://github.com/apache/flink-cdc/pull/3680
   This fixes schema operator hanging glitch under extreme parallelized 
pressure.
   
   ---
   
   Previously, `SchemaOperator`s will ask for schema evolve permission first, 
before sending `FlushEvent`s to downstream sinks. However, Flink regards 
`FlushEvent`s as normal data records and might block it to align checkpoint 
barriers. It might cause the following deadlock situation:
   
   * `SchemaOperator` A has obtained schema evolution permission
   * `SchemaOperator` B does not get the permission and hangs
   * `SchemaOperator` A sends a `FlushEvent`, but after a checkpoint barrier
   * `SchemaOperator` B received a checkpoint barrier after the schema change 
event (which is blocked)
   
   Now, neither A nor B can post any event records to downstream, and the 
entire job blocks with the following iconic error message (in TM):
   
   ```
   0> Schema Registry is busy now, waiting for next request...
   1> Schema Registry is busy now, waiting for next request...
   2> Schema Registry is busy now, waiting for next request...
   [Repeated lines]
   ```
   
   ---
   
   This PR changes the schema evolution permission requesting workflow by:
   
   * `SchemaOperator`s emit `FlushEvent` immediately when they received a 
`SchemaChangeEvent`.
   * A schema change request could only be permitted when a) SchemaRegistry is 
IDLE and b) the requesting `SchemaOperator` has finished data flushing already.
   
   Since `FlushEvent` might be emitted from multiple `SchemaOperator`s 
simultaneously, a nonce value that is uniquely bound to a schema change event 
is added into `FlushEvent` payload. `WAITING_FOR_FLUSH` stage is no longer 
necessary since this state will not block the `SchemaRegistry` but one single 
`SchemaOperator` now.
   
   ```
   --------
   | IDLE | -------------------A
   --------                    |
      ^                        |
       C                       |
        \                      v
   ------------          ------------
   | FINISHED | <-- B -- | APPLYING |
   ------------          ------------
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to