[ https://issues.apache.org/jira/browse/BEAM-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385676#comment-17385676 ]
Beam JIRA Bot commented on BEAM-11916: -------------------------------------- This issue was marked "stale-P2" and has not received a public comment in 14 days. It is now automatically moved to P3. If you are still affected by it, you can comment and move it back to P2. > Combine failed on large PCollection of uint64 arrays > ---------------------------------------------------- > > Key: BEAM-11916 > URL: https://issues.apache.org/jira/browse/BEAM-11916 > Project: Beam > Issue Type: Bug > Components: sdk-go > Affects Versions: 2.28.0 > Environment: Google Dataflow > Reporter: Tao Liao > Priority: P3 > Labels: GCP > Attachments: dataflow autoscaling.png > > > We came across an issue with the Combine operation with Apache Beam Go SDK > (v2.28.0), when running a pipeline on Google Cloud Dataflow. Source code: > https://github.com/le0000000/dataflow_combine > We understand that the Go SDK is experimental but it would be great if > someone can help us understand if there’s anything wrong with our code, or if > there's a bug in the Go SDK or Dataflow. The issue only happens when running > the pipeline with Google Dataflow, with some large data set. We are trying to > combine a _PCollection<pairedVec>_, with > _type pairedVec struct {_ > _Vec1 [1048576]uint64_ > _Vec2 [1048576]uint64_ > _}_ > There are 10,000,000 items in the PCollection. After reading the input file, > Dataflow scheduled 1000 workers to generate the PCollection, and started to > do the combination. Then the worker number reduced to almost 1 and lasted for > a very long time. Eventually the job failed with the following error log: > 2021-03-02T06:13:40.438112597ZWorkflow failed. Causes: > S09:CombinePerKey/CoGBK'1/Read+CombinePerKey/main.combineVecFn+CombinePerKey/main.combineVecFn/Extract+beam.dropKeyFn+main.flattenVecFn+textio.Write/beam.addFixedKeyFn+textio.Write/CoGBK/Write > failed., The job failed because a work item has failed 4 times. Look in > previous log entries for the cause of each one of the 4 failures. For more > information, see https://cloud.google.com/dataflow/docs/guides/common-errors. > The work item was attempted on these workers: > go-job-1-1614659244459204-03012027-u5s6-harness-q8tx Root cause: The worker > lost contact with the service., > go-job-1-1614659244459204-03012027-u5s6-harness-44hk Root cause: The worker > lost contact with the service., > go-job-1-1614659244459204-03012027-u5s6-harness-05nm Root cause: The worker > lost contact with the service., > go-job-1-1614659244459204-03012027-u5s6-harness-l22w Root cause: The worker > lost contact with the service. > -- This message was sent by Atlassian Jira (v8.3.4#803005)