On Wed, Apr 24, 2024 at 12:40 PM Wiśniowski Piotr < [email protected]> wrote:
> Hi! > > Thank you for the hint. We will try with the mitigation from the issue. We > did already tried everything from > https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact > , but lets hope upgrading the dependency will help. Will keep reply to this > thread once I get confirmation. > > BTW great job on the investigation of bug that you mentioned. Impressive. > Seems like a nasty one. > Thanks. I was specifically recommending you check the recently added content under "It might be possible to retrieve stacktraces of a thread that is holding the GIL on a running Dataflow worker as follows:", as that should help find out what is causing stuckness in your case. But hopefully it won't be necessary after you adjust the grpcio version. > Best, > > Wiśniowski Piotr > On 24.04.2024 00:31, Valentyn Tymofieiev via user wrote: > > You might be running into https://github.com/apache/beam/issues/30867. > > Among the error messages you mentioned, the following is closer to > rootcause: ``Error message from worker: generic::internal: Error > encountered with the status channel: There are 10 consecutive failures > obtaining SDK worker status info from sdk-0-0. The last success response > was received 3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00. > SDK worker appears to be permanently unresponsive. Aborting the SDK. For > more information, see: > https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact``` > <https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact> > > If mitigations in https://github.com/apache/beam/issues/30867 don't > resolve your issue, please see > https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact > for insturctions on how to find what causes the workers to be stuck. > > Thanks! > > On Tue, Apr 23, 2024 at 12:17 PM Wiśniowski Piotr < > [email protected]> wrote: > >> Hi, >> We are investigating an issue with our Python SDK streaming pipelines, >> and have few questions, but first context. >> Our stack: >> - Python SDK 2.54.0 but we tried also 2.55.1 >> - DataFlow Streaming engine with sdk in container image (we tried also >> Prime) >> - Currently our pipelines do have low enough traffic, so that single >> node handles it most of the time, but occasionally we do scale up. >> - Deployment by Terraform `google_dataflow_flex_template_job` resource, >> which normally does job update when re-applying Terraform. >> - We do use a lot `ReadModifyWriteStateSpec`, other states and watermark >> timers, but we do keep a the size of state under control. >> - We do use custom coders as Pydantic avro. >> The issue: >> - Occasionally watermark progression stops. The issue is not >> deterministic, and happens like 1-2 per day for few pipelines. >> - No user code errors reported- but we do get errors like this: >> ```INTERNAL: The work item requesting state read is no longer valid on >> the backend. The work has already completed or will be retried. This is >> expected during autoscaling events. [ >> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto] >> <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D> >> { trail_point { source_file_loc { filepath: >> "dist_proc/windmill/client/streaming_rpc_client.cc" line: 767 } } }']``` >> ```ABORTED: SDK harness sdk-0-0 disconnected. This usually means that the >> process running the pipeline code has crashed. Inspect the Worker Logs and >> the Diagnostics tab to determine the cause of the crash. [ >> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto] >> <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D> >> { trail_point { source_file_loc { filepath: >> "dist_proc/dax/workflow/worker/fnapi_control_service.cc" line: 217 } } } >> [dist_proc.dax.MessageCode] { origin_id: 5391582787251181999 >> [dist_proc.dax.workflow.workflow_io_message_ext]: SDK_DISCONNECT }']``` >> ```Work item for sharding key 8dd4578b4f280f5d tokens >> (1316764909133315359, 17766288489530478880) encountered error during >> processing, will be retried (possibly on another worker): >> generic::internal: Error encountered with the status channel: SDK harness >> sdk-0-0 disconnected. with MessageCode: (93f1db2f7a4a325c): SDK >> disconnect.``` >> ```Python (worker sdk-0-0_sibling_1) exited 1 times: signal: segmentation >> fault (core dumped) restarting SDK process``` >> - We did manage to correlate this with either vertical autoscaling event >> (when using Prime) or other worker replacements done by Dataflow under the >> hood, but this is not deterministic. >> - For few hours watermark progress does stop, but other workers do >> process messages. >> - and after few hours: >> ```Error message from worker: generic::internal: Error encountered with >> the status channel: There are 10 consecutive failures obtaining SDK worker >> status info from sdk-0-0. The last success response was received >> 3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00. SDK worker >> appears to be permanently unresponsive. Aborting the SDK. For more >> information, see: >> https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact >> ``` >> - And the pipeline starts to catch up and watermark progresses again. >> - Job update by Terraform apply also fixes the issue. >> - We do not see any extensive use of worker memory nor disk. CPU >> utilization is also most of the time close to idle. I do not think we do >> use C/C++ code with python. Nor use parallelism/threads outside beam >> parallelization. >> Questions: >> 1. What could be potential causes of such behavior? How to get more >> insights to this problem? >> 2. I have seen `In Python pipelines, when shutting down inactive bundle >> processors, shutdown logic can overaggressively hold the lock, blocking >> acceptance of new work` in Beam release docs as known issue. What is the >> status of this? Can this potentially be related? >> Really appreciate any help, clues or hints how to debug this issue. >> Best regards >> Wiśniowski Piotr >> >
