Hi,

I am running a TensorFlow Extended (TFX) pipeline which uses Apache Beam
for data processing which in turn has a Flink Runner (Basically a batch job
on a Flink Session Cluster on Kubernetes) version 1.13.6, but the job (for
gathering stats) gets stuck.

There is nothing significant in the Job Manager or Task Manager logs. The
only thing that possibly might tell why the task is stuck seems to be a
thread dump:

"MapPartition (MapPartition at [14]{TFXIORead[train],
GenerateStatistics[train]}) (1/32)#0" Id=188 WAITING on
java.util.concurrent.CompletableFuture$Signaller@6f078632
    at sun.misc.Unsafe.park(Native Method)
    - waiting on java.util.concurrent.CompletableFuture$Signaller@6f078632
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.CompletableFuture$Signaller.block(
CompletableFuture.java:1707)
    at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323
)
    at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture
.java:1742)
    at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:
1908)
    at org.apache.beam.sdk.util.MoreFutures.get(MoreFutures.java:60)
    at org.apache.beam.runners.fnexecution.control.
SdkHarnessClient$BundleProcessor$ActiveBundle.close(SdkHarnessClient.java:
504)
    ...
I use 32 parallel degrees. Task managers are set, so each TM runs in one
container with 1 CPU and a total process memory set to 20 GB. Each TM runs
1 tasks slot.

This is failing with ~100 files with a total size of about 100 GB. If I run
the pipeline with a smaller number of files to process, it runs ok.

I need Flink to be able to process different amounts of data as it is able
to scale by automatically adding pods depending on the parallel degree
setting for the specific job (I set the parallel degree to the max(number
of files,32))

Thanks,
Gorjan

Reply via email to