Hi Gorjan,
+user@beam <mailto:u...@beam.apache.org>
The trace you posted is just waiting for a bundle to finish in the SDK
harness. I would suspect there is a problem in the logs of the harness.
Did you look for possible errors there?
Jan
On 5/31/22 13:54, Gorjan Todorovski wrote:
Hi,
I am running a TensorFlow Extended (TFX) pipeline which uses Apache
Beam for data processing which in turn has a Flink Runner (Basically a
batch job on a Flink Session Cluster on Kubernetes) version 1.13.6,
but the job (for gathering stats) gets stuck.
There is nothing significant in the Job Manager or Task Manager logs.
The only thing that possibly might tell why the task is stuck seems to
be a thread dump:
"MapPartition (MapPartition at [14]{TFXIORead[train],
GenerateStatistics[train]}) (1/32)#0" Id=188 WAITING on
java.util.concurrent.CompletableFuture$Signaller@6f078632
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.CompletableFuture$Signaller@6f078632
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
at
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at org.apache.beam.sdk.util.MoreFutures.get(MoreFutures.java:60)
at
org.apache.beam.runners.fnexecution.control.SdkHarnessClient$BundleProcessor$ActiveBundle.close(SdkHarnessClient.java:504)
...
I use 32 parallel degrees. Task managers are set, so each TM runs in
one container with 1 CPU and a total process memory set to 20 GB. Each
TM runs 1 tasksslot.
This is failing with ~100 files with a total size of about 100 GB. If
I run the pipeline with a smaller number of files to process, it runs ok.
I need Flink to be able to process different amounts of data as it is
able to scale by automatically adding pods depending on the parallel
degree setting for the specific job (I set the parallel degree to the
max(number of files,32))
Thanks,
Gorjan