I forgot to mention that my jobs are all batch (at the moment).
Do you think that this problem could be related to
- http://www.evanjones.ca/java-bytebuffer-leak.html#comment-3240054880
- and http://www.evanjones.ca/java-native-leak-bug.html
Kurt told me also to add "env.java.opts: -Dio.ne
Hi Stephan,
I also think that the error is more related to netty.
The only suspicious library I use are parquet or thrift.
I'm not using off-heap memory.
What do you mean for "crazy high number of concurrent network shuffles"?how
can I count that?
We're using java 8.
Thanks a lot,
Flavio
On 6 J
Hi!
I would actually be surprised if this is an issue in core Flink.
- The MaxDirectMemory parameter is pretty meaningless, it really is a max
and does not have an impact on how much is actually allocated.
- In most cases we had reported so far, the leak was in a library that
was used in the
Hi Flavio,
can you post the all memory configuration parameters of your workers?
Did you investigate which whether the direct or heap memory grew?
Thanks, Fabian
2017-05-29 20:53 GMT+02:00 Flavio Pompermaier :
> Hi to all,
> I'm still trying to understand what's going on our production Flink
>
Hi to all,
I'm still trying to understand what's going on our production Flink cluster.
The facts are:
1. The Flink cluster runs on 5 VMWare VMs managed by ESXi
2. On a specific job we have, without limiting the direct memory to 5g,
the TM gets killed by the OS almost immediately because the memo
FYI: taskmanager.sh sets this parameter but also states the following:
# Long.MAX_VALUE in TB: This is an upper bound, much less direct memory will
be used
TM_MAX_OFFHEAP_SIZE="8388607T"
Nico
On Monday, 29 May 2017 15:19:47 CEST Aljoscha Krettek wrote:
> Hi Flavio,
>
> Is this running on
Hi Flavio,
Is this running on YARN or bare metal? Did you manage to find out where this
insanely large parameter is coming from?
Best,
Aljoscha
> On 25. May 2017, at 19:36, Flavio Pompermaier wrote:
>
> Hi to all,
> I think we found the root cause of all the problems. Looking ad dmesg there
Hi to all,
I think we found the root cause of all the problems. Looking ad dmesg there
was a "crazy" total-vm size associated to the OOM error, a LOT much bigger
than the TaskManager's available memory.
In our case, the TM had a max heap of 14 GB while the dmsg error was
reporting a required amount
I can confirm that after giving less memory to the Flink TM the job was
able to run successfully.
After almost 2 weeks of pain, we summarize here our experience with Fink in
virtualized environments (such as VMWare ESXi):
1. Disable the virtualization "feature" that transfer a VM from a (heavy
Hi Greg, you were right! After typing dmsg I found "Out of memory: Kill
process 13574 (java)".
This is really strange because the JVM of the TM is very calm.
Moreover, there are 7 GB of memory available (out of 32) but somehow the OS
decides to start swapping and, when it runs out of available swap
Hi Greg,
I carefully monitored all TM memory with jstat -gcutil and there'no full
gc, only .
The initial situation on the dying TM is:
S0 S1 E O M CCSYGC YGCTFGCFGCT
GCT
0.00 100.00 33.57 88.74 98.42 97.171592.508 10.255
2.763
0.00 1
Hi Flavio,
Flink handles interrupts so the only silent killer I am aware of is Linux's
OOM killer. Are you seeing such a message in dmesg?
Greg
On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier
wrote:
> Hi to all,
> I'd like to know whether memory swapping could cause a taskmanager crash.
>
Hi to all,
I'd like to know whether memory swapping could cause a taskmanager crash.
In my cluster of virtual machines 'm seeing this strange behavior in my
Flink cluster: sometimes, if memory get swapped the taskmanager (on that
machine) dies unexpectedly without any log about the error.
Is that
13 matches
Mail list logo