[ 
https://issues.apache.org/jira/browse/BEAM-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385676#comment-17385676
 ] 

Beam JIRA Bot commented on BEAM-11916:
--------------------------------------

This issue was marked "stale-P2" and has not received a public comment in 14 
days. It is now automatically moved to P3. If you are still affected by it, you 
can comment and move it back to P2.

> Combine failed on large PCollection of uint64 arrays
> ----------------------------------------------------
>
>                 Key: BEAM-11916
>                 URL: https://issues.apache.org/jira/browse/BEAM-11916
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-go
>    Affects Versions: 2.28.0
>         Environment: Google Dataflow
>            Reporter: Tao Liao
>            Priority: P3
>              Labels: GCP
>         Attachments: dataflow autoscaling.png
>
>
> We came across an issue with the Combine operation with Apache Beam Go SDK 
> (v2.28.0), when running a pipeline on Google Cloud Dataflow. Source code: 
> https://github.com/le0000000/dataflow_combine
> We understand that the Go SDK is experimental but it would be great if 
> someone can help us understand if there’s anything wrong with our code, or if 
> there's a bug in the Go SDK or Dataflow. The issue only happens when running 
> the pipeline with Google Dataflow, with some large data set. We are trying to 
> combine a _PCollection<pairedVec>_, with
> _type pairedVec struct {_
>     _Vec1 [1048576]uint64_
>     _Vec2 [1048576]uint64_
> _}_
> There are 10,000,000 items in the PCollection. After reading the input file, 
> Dataflow scheduled 1000 workers to generate the PCollection, and started to 
> do the combination. Then the worker number reduced to almost 1 and lasted for 
> a very long time. Eventually the job failed with the following error log:
> 2021-03-02T06:13:40.438112597ZWorkflow failed. Causes: 
> S09:CombinePerKey/CoGBK'1/Read+CombinePerKey/main.combineVecFn+CombinePerKey/main.combineVecFn/Extract+beam.dropKeyFn+main.flattenVecFn+textio.Write/beam.addFixedKeyFn+textio.Write/CoGBK/Write
>  failed., The job failed because a work item has failed 4 times. Look in 
> previous log entries for the cause of each one of the 4 failures. For more 
> information, see https://cloud.google.com/dataflow/docs/guides/common-errors. 
> The work item was attempted on these workers: 
> go-job-1-1614659244459204-03012027-u5s6-harness-q8tx Root cause: The worker 
> lost contact with the service., 
> go-job-1-1614659244459204-03012027-u5s6-harness-44hk Root cause: The worker 
> lost contact with the service., 
> go-job-1-1614659244459204-03012027-u5s6-harness-05nm Root cause: The worker 
> lost contact with the service., 
> go-job-1-1614659244459204-03012027-u5s6-harness-l22w Root cause: The worker 
> lost contact with the service.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to