Hi!
The Netty memory usually goes much lower than 2*network memory (that is
theoretical).
Netty needs memory at the size two buffers on the sender and receiver side,
per TCP connection.
Since Flink usually multiplexes many Channels (that need network buffers)
through the same TCP connection, the
The TaskManagers were nixed by the OOM killer.
[63896.699500] Out of memory: Kill process 12892 (java) score 910 or
sacrifice child
[63896.702018] Killed process 12892 (java) total-vm:47398740kB,
anon-rss:28487812kB, file-rss:8kB
The cluster is comprised of AWS c4.8xlarge instances which have
The logging of the TaskManager stops 3 seconds before the JobManager
detects that the connection to the TaskManager is failed. If the clocks are
remotely in sync and the TaskManager is still running, then we should also
see logging statements for the time after the connection has failed.
Therefore,
So is the TaskManager JVM still running after the JM detected that the TM
has gone?
If not, can you check the kernel log (dmesg) to see whether Linux OOM
killer stopped the process? (if its a kill, the JVM might not be able to
log anything anymore)
On Thu, Oct 29, 2015 at 9:27 PM, Stephan Ewen w
Thanks for sharing the logs, Greg!
Okay, so the TaskManager does not crash, but the Remote Failure Detector of
Akka marks the connection between JobManager and TaskManager as broken.
The TaskManager is not doing much GC, so it is not a long JVM freeze that
causes hearbeats to time out...
I am wo
Could it be a problem that there are two TaskManagers running per machine?
> On 29 Oct 2015, at 19:04, Greg Hogan wrote:
>
> I have memory logging enabled. Tail of TaskManager log on 10.0.88.140:
>
> 17:35:26,415 INFO
> org.apache.flink.runtime.taskmanager.TaskManager - Garbage
> c
I have memory logging enabled. Tail of TaskManager log on 10.0.88.140:
17:35:26,415 INFO
org.apache.flink.runtime.taskmanager.TaskManager - Garbage
collector stats: [PS Scavenge, GC TIME (ms): 341, GC COUNT: 3], [PS
MarkSweep, GC TIME (ms): 974, GC COUNT: 1]
17:35:27,415 INFO
org.apac
What does the log of the failed TaskManager 10.0.88.140 say?
On Thu, Oct 29, 2015 at 6:44 PM, Greg Hogan wrote:
> I removed the use of numactl but left in starting two TaskManagers and am
> still seeing TaskManagers crash.
> From the JobManager log:
>
> 17:36:06,412 WARN
> akka.remote.ReliableDe
I removed the use of numactl but left in starting two TaskManagers and am
still seeing TaskManagers crash.
>From the JobManager log:
17:36:06,412 WARN
akka.remote.ReliableDeliverySupervisor- Association
with remote system [akka.tcp://flink@10.0.88.140:45742] has failed, add
Hi Greg!
Interesting... When you say the TaskManagers are dropping, are the
TaskManager processes crashing, or are they loosing connection to the
JobManager?
Greetings,
Stephan
On Thu, Oct 29, 2015 at 9:56 AM, Greg Hogan wrote:
> I recently discovered that AWS uses NUMA for its largest nodes.
I recently discovered that AWS uses NUMA for its largest nodes. An example
c4.8xlarge:
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 18 19 20 21 22 23 24 25 26
node 0 size: 29813 MB
node 0 free: 24537 MB
node 1 cpus: 9 10 11 12 13 14 15 16 17 27 28 29 30 31 32 33 34
Hi Greg,
Thanks for reporting. You wrote you didn't see any output in the .out files
of the task managers. What about the .log files of these instances?
Where and when did you produce the thread dump you included?
Thanks,
Max
On Thu, Oct 29, 2015 at 1:46 PM, Greg Hogan wrote:
> I am testing a
I am testing again on a 64 node cluster (the JobManager is running fine
having reduced some operator's parallelism and fixed the string conversion
performance).
I am seeing TaskManagers drop like flies every other job or so. I am not
seeing any output in the .out log files corresponding to the cra
13 matches
Mail list logo