Hi Michał,
You might want to get the logs of the container
(container_e01_1580289509522_0001_01_000002) to see what happened there.
Cheers,
Max
On 29.01.20 12:17, Michał Walenia wrote:
Hi there,
I want to add some load tests for core Beam operations (GBK, CoGBK,
Combine, ParDo, SideInput) on portable Flink in Java. For some reason
some of my test scenarios for Combine seem to crash the whole Flink cluster.
The PR introducing the tests is here:
https://github.com/apache/beam/pull/10386
As you can see in the logs of failing Jenkins jobs, the scenario
combining 2GB of records on 5 workers works well, whereas 2GB fanned out
on 16 workers fails with a cryptic message of either
"org.apache.beam.vendor.grpc.v1p26p0.io.grpc.StatusRuntimeException:
CANCELLED: cancelled before receiving half close"
or
"java.util.concurrent.TimeoutException: The heartbeat of TaskManager
with id container_e01_1580289509522_0001_01_000002 timed out."
link to logs with first error:
https://builds.apache.org/job/beam_LoadTests_Java_Combine_Portable_Flink_Batch_PR/9/console
link to logs with second error:
https://builds.apache.org/job/beam_LoadTests_Java_Combine_Portable_Flink_Batch_PR/11/console
I wonder what the issue might be here. When I ran the same load test on
a Flink cluster created in an identical way on a separate GCP project,
everything went well, which makes me think there may be something wrong
with the Dataproc setup.
Another important note - Combine load tests work and pass with the
Python portable runner.
I will be grateful for any help you can provide - I'm not a Flink
cluster expert and I have no idea how can I change configurations so
that it works.
Thanks and have a good day!
Michal
--
Michał Walenia
Polidea <https://www.polidea.com/> | Software Engineer
M: +48 791 432 002 <tel:+48791432002>
E: [email protected] <mailto:[email protected]>
Unique Tech
Check out our projects! <https://www.polidea.com/our-work>