On Wed, Apr 24, 2024 at 12:40 PM Wiśniowski Piotr <
[email protected]> wrote:

> Hi!
>
> Thank you for the hint. We will try with the mitigation from the issue. We
> did already tried everything from
> https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
> , but lets hope upgrading the dependency will help. Will keep reply to this
> thread once I get confirmation.
>
> BTW great job on the investigation of bug that you mentioned. Impressive.
> Seems like a nasty one.
>
Thanks. I was specifically recommending you check the recently added
content under "It might be possible to retrieve stacktraces of a thread
that is holding the GIL on a running Dataflow worker as follows:", as that
should help find out what is causing stuckness in your case. But hopefully
it won't be necessary after you adjust the grpcio version.



> Best,
>
> Wiśniowski Piotr
> On 24.04.2024 00:31, Valentyn Tymofieiev via user wrote:
>
> You might be running into https://github.com/apache/beam/issues/30867.
>
> Among the error messages you mentioned, the  following is closer to
> rootcause: ``Error message from worker: generic::internal: Error
> encountered with the status channel: There are 10 consecutive failures
> obtaining SDK worker status info from sdk-0-0. The last success response
> was received 3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00.
> SDK worker appears to be permanently unresponsive. Aborting the SDK. For
> more information, see:
> https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact```
> <https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact>
>
> If mitigations in https://github.com/apache/beam/issues/30867 don't
> resolve your issue, please see
> https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
> for insturctions on how to find what causes the workers to be stuck.
>
> Thanks!
>
> On Tue, Apr 23, 2024 at 12:17 PM Wiśniowski Piotr <
> [email protected]> wrote:
>
>> Hi,
>> We are investigating an issue with our Python SDK streaming pipelines,
>> and have few questions, but first context.
>> Our stack:
>> - Python SDK 2.54.0 but we tried also 2.55.1
>> - DataFlow Streaming engine with sdk in container image (we tried also
>> Prime)
>> - Currently our pipelines do have low enough traffic, so that single
>> node handles it most of the time, but occasionally we do scale up.
>> - Deployment by Terraform `google_dataflow_flex_template_job` resource,
>> which normally does job update when re-applying Terraform.
>> - We do use a lot `ReadModifyWriteStateSpec`, other states and watermark
>> timers, but we do keep a the size of state under control.
>> - We do use custom coders as Pydantic avro.
>> The issue:
>> - Occasionally watermark progression stops. The issue is not
>> deterministic, and happens like 1-2 per day for few pipelines.
>> - No user code errors reported- but we do get errors like this:
>> ```INTERNAL: The work item requesting state read is no longer valid on
>> the backend. The work has already completed or will be retried. This is
>> expected during autoscaling events. [
>> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
>> <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D>
>> { trail_point { source_file_loc { filepath:
>> "dist_proc/windmill/client/streaming_rpc_client.cc" line: 767 } } }']```
>> ```ABORTED: SDK harness sdk-0-0 disconnected. This usually means that the
>> process running the pipeline code has crashed. Inspect the Worker Logs and
>> the Diagnostics tab to determine the cause of the crash. [
>> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
>> <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D>
>> { trail_point { source_file_loc { filepath:
>> "dist_proc/dax/workflow/worker/fnapi_control_service.cc" line: 217 } } }
>> [dist_proc.dax.MessageCode] { origin_id: 5391582787251181999
>> [dist_proc.dax.workflow.workflow_io_message_ext]: SDK_DISCONNECT }']```
>> ```Work item for sharding key 8dd4578b4f280f5d tokens
>> (1316764909133315359, 17766288489530478880) encountered error during
>> processing, will be retried (possibly on another worker):
>> generic::internal: Error encountered with the status channel: SDK harness
>> sdk-0-0 disconnected. with MessageCode: (93f1db2f7a4a325c): SDK
>> disconnect.```
>> ```Python (worker sdk-0-0_sibling_1) exited 1 times: signal: segmentation
>> fault (core dumped) restarting SDK process```
>> - We did manage to correlate this with either vertical autoscaling event
>> (when using Prime) or other worker replacements done by Dataflow under the
>> hood, but this is not deterministic.
>> - For few hours watermark progress does stop, but other workers do
>> process messages.
>> - and after few hours:
>> ```Error message from worker: generic::internal: Error encountered with
>> the status channel: There are 10 consecutive failures obtaining SDK worker
>> status info from sdk-0-0. The last success response was received
>> 3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00. SDK worker
>> appears to be permanently unresponsive. Aborting the SDK. For more
>> information, see:
>> https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
>> ```
>> - And the pipeline starts to catch up and watermark progresses again.
>> - Job update by Terraform apply also fixes the issue.
>> - We do not see any extensive use of worker memory nor disk. CPU
>> utilization is also most of the time close to idle. I do not think we do
>> use C/C++ code with python. Nor use parallelism/threads outside beam
>> parallelization.
>> Questions:
>> 1. What could be potential causes of such behavior? How to get more
>> insights to this problem?
>> 2. I have seen `In Python pipelines, when shutting down inactive bundle
>> processors, shutdown logic can overaggressively hold the lock, blocking
>> acceptance of new work` in Beam release docs as known issue. What is the
>> status of this? Can this potentially be related?
>> Really appreciate any help, clues or hints how to debug this issue.
>> Best regards
>> Wiśniowski Piotr
>>
>

Reply via email to