Created an issue for this: https://issues.apache.org/jira/browse/BEAM-11251
On 11.11.20 19:09, Aljoscha Krettek wrote:
Hi,
nice work on debugging this!
We need the synchronized block in the source because the call to
reader.advance() (via the invoker) and reader.getCurrent() (via
emitElement
Hi,
nice work on debugging this!
We need the synchronized block in the source because the call to
reader.advance() (via the invoker) and reader.getCurrent() (via
emitElement()) must be atomic with respect to state. We cannot advance
the reader state, not emit that record but still checkpoint
Hi Josson,
Thanks for great investigation and coming back to use. Aljoscha, could you
help us here? It looks like you were involved in this original BEAM-3087
issue.
Best,
Piotrek
pt., 23 paź 2020 o 07:36 Josson Paul napisał(a):
> @Piotr Nowojski @Nico Kruber
>
> An update.
>
> I am able to
@Piotr Nowojski @Nico Kruber
An update.
I am able to figure out the problem code. A change in the Apache Beam code
is causing this problem.
Beam introduced a lock on the “emit” in Unbounded Source. The lock is on
the Flink’s check point lock. Now the same lock is used by Flink’s timer
ser
Hi Josson,
The TM logs that you attached are only from a 5 minutes time period. Are
you sure they are encompassing the period before the potential failure and
after the potential failure? It would be also nice if you would provide the
logs matching to the charts (like the one you were providing in
Hi Josson,
Have you checked the logs as Nico suggested? At 18:55 there is a dip in
non-heap memory, just about when the problems started happening. Maybe you
could post the TM logs?
Have you tried updating JVM to a newer version?
Also it looks like the heap size is the same between 1.4 and 1.8, bu
What looks a bit strange to me is that with a running job, the
SystemProcessingTimeService should actually not be collected (since it is
still in use)!
My guess is that something is indeed happening during that time frame (maybe
job restarts?) and I would propose to check your logs for anything
Hi Josson,
Thanks again for the detailed answer, and sorry that I can not help you
with some immediate answer. I presume that jvm args for 1.8 are the same?
Can you maybe post what exactly has crashed in your cases a) and b)?
Re c), in the previously attached word document, it looks like Flink wa
Hi Josson,
Thanks for getting back.
What are the JVM settings and in particular GC settings that you are using
(G1GC?)?
It could also be an issue that in 1.4 you were just slightly below the
threshold of GC issues, while in 1.8, something is using a bit more memory,
causing the GC issues to appea
Hi Piotr,
2) SystemProcessingTimeService holds the HeapKeyedStateBackend and
HeapKeyedStateBackend has lot of Objects and that is filling the Heap
3) I am not using Flink Kafka Connector. But we are using Apache Beam
kafka connector. There is a change in the Apache Beam version. But the
kafk
Hi Josson,
2. Are you sure that all/vast majority of those objects are pointing
towards SystemProcessingTimeService? And is this really the problem of
those objects? Are they taking that much of the memory?
3. It still could be Kafka's problem, as it's likely that between 1.4 and
1.8.x we bumped K
1) We are in the process of migrating to Flink 1.11. But it is going to
take a while before we can make everything work with the latest version.
Meanwhile since this is happening in production I am trying to solve this.
2) Finalizae class is pointing
to org.apache.flink.streaming.runtime.tasks.Syst
Hi,
Have you tried using a more recent Flink version? 1.8.x is no longer
supported, and latest versions might not have this issue anymore.
Secondly, have you tried backtracking those references to the Finalizers?
Assuming that Finalizer is indeed the class causing problems.
Also it may well be a
13 matches
Mail list logo