Hey, Is that whole Task Manager log? Have you checked memory issues both on Task Managers and the Job Manager? Like out of memory/long GC pauses as I suggested in the first email?
After you rule memory issues, you could capture couple of thread dumps (`kill -3 JVM_PID` or `jstack JVM_PID`) and check if any thread is stuck in your code. Another potential issue, are you sure that you have a healthy network between nodes? No packet losts, low ping etc? Piotrek > On 10 Dec 2018, at 17:44, Alieh <sae...@informatik.uni-leipzig.de> wrote: > > > > > > > > > > > > > > Hello, > > this is the task manage log but it does not change after I run the program. > I think the Flink planner has problem with my program. It can not even start > the job. > > Best, > > Alieh > > > > 018-12-10 12:20:20,386 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - > -------------------------------------------------------------------------------- > 2018-12-10 12:20:20,387 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Starting > TaskManager (Version: 1.6.0, Rev:ff472b4, Date:07.08.2018 @ 13:31:13 UTC) > 2018-12-10 12:20:20,387 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - OS current > user: alieh > 2018-12-10 12:20:20,609 WARN org.apache.hadoop.util.NativeCodeLoader > - Unable to load native-hadoop library for your platform... > using builtin-java classes where applicable > 2018-12-10 12:20:20,768 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Current > Hadoop/Kerberos user: alieh > 2018-12-10 12:20:20,769 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM: Java > HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.161-b12 > 2018-12-10 12:20:20,769 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum heap > size: 922 MiBytes > 2018-12-10 12:20:20,769 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JAVA_HOME: > /usr/lib/jvm/java-8-oracle > 2018-12-10 12:20:20,774 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Hadoop > version: 2.4.1 > 2018-12-10 12:20:20,775 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM Options: > 2018-12-10 12:20:20,775 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - > -XX:+UseG1GC > 2018-12-10 12:20:20,775 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xms922M > 2018-12-10 12:20:20,775 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xmx922M > 2018-12-10 12:20:20,775 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - > -XX:MaxDirectMemorySize=8388607T > 2018-12-10 12:20:20,775 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - > -Dlog.file=/home/alieh/flink-1.6.0/log/flink-alieh-taskexecutor-0-alieh-P67A-D3-B3.log > 2018-12-10 12:20:20,775 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - > -Dlog4j.configuration=file:/home/alieh/flink-1.6.0/conf/log4j.properties > <file:///home/alieh/flink-1.6.0/conf/log4j.properties> > 2018-12-10 12:20:20,775 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - > -Dlogback.configurationFile=file:/home/alieh/flink-1.6.0/conf/logback.xml > <file:///home/alieh/flink-1.6.0/conf/logback.xml> > 2018-12-10 12:20:20,775 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Program > Arguments: > 2018-12-10 12:20:20,776 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - > --configDir > 2018-12-10 12:20:20,776 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - > /home/alieh/flink-1.6.0/conf > 2018-12-10 12:20:20,776 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Classpath: > /home/alieh/flink-1.6.0/lib/flink-python_2.11-1.6.0.jar:/home/alieh/flink-1.6.0/lib/flink-shaded-hadoop2-uber-1.6.0.jar:/home/alieh/flink-1.6.0/lib/log4j-1.2.17.jar:/home/alieh/flink-1.6.0/lib/slf4j-log4j12-1.7.7.jar:/home/alieh/flink-1.6.0/lib/flink-dist_2.11-1.6.0.jar::: > 2018-12-10 12:20:20,776 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - > -------------------------------------------------------------------------------- > 2018-12-10 12:20:20,777 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Registered > UNIX signal handlers for [TERM, HUP, INT] > 2018-12-10 12:20:20,785 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum > number of open file descriptors is 1048576. > 2018-12-10 12:20:20,803 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.address, localhost > 2018-12-10 12:20:20,803 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.port, 6123 > 2018-12-10 12:20:20,803 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.heap.size, 1024m > 2018-12-10 12:20:20,803 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.heap.size, 1024m > 2018-12-10 12:20:20,803 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.numberOfTaskSlots, 1 > 2018-12-10 12:20:20,803 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: parallelism.default, 1 > 2018-12-10 12:20:20,804 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: rest.port, 8081 > 2018-12-10 12:20:20,912 INFO > org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user > set to alieh (auth:SIMPLE) > 2018-12-10 12:20:21,131 WARN org.apache.flink.configuration.Configuration > - Config uses deprecated configuration key > 'jobmanager.rpc.address' instead of proper key 'rest.address' > 2018-12-10 12:20:21,135 INFO > org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to > select the network interface and address to use by connecting to the leading > JobManager. > 2018-12-10 12:20:21,136 INFO > org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager > will try to connect for 10000 milliseconds before falling back to heuristics > 2018-12-10 12:20:21,145 INFO org.apache.flink.runtime.net.ConnectionUtils > - Retrieved new target address localhost/127.0.0.1:6123. > 2018-12-10 12:20:21,204 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - TaskManager > will use hostname/address 'alieh-P67A-D3-B3' (127.0.1.1) for communication. > 2018-12-10 12:20:21,208 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Starting > AkkaRpcService at alieh-p67a-d3-b3:0. > 2018-12-10 12:20:21,805 INFO akka.event.slf4j.Slf4jLogger > - Slf4jLogger started > 2018-12-10 12:20:21,898 INFO akka.remote.Remoting > - Starting remoting > 2018-12-10 12:20:22,091 INFO akka.remote.Remoting > - Remoting started; listening on addresses > :[akka.tcp://flink@alieh-p67a-d3-b3:44267] > 2018-12-10 12:20:22,117 INFO > org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics > reporter configured, no metrics will be exposed/reported. > 2018-12-10 12:20:22,124 INFO > org.apache.flink.runtime.blob.PermanentBlobCache - Created BLOB > cache storage directory /tmp/blobStore-32ec7a05-737e-4b46-b716-3a0831683c47 > 2018-12-10 12:20:22,127 INFO > org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB > cache storage directory /tmp/blobStore-4b33c843-b7d3-45dc-814f-850e8c6be21a > 2018-12-10 12:20:22,136 INFO > org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig > [server address: alieh-P67A-D3-B3/127.0.1.1, server port: 0, ssl enabled: > false, memory segment size (bytes): 32768, transport type: NIO, number of > server threads: 1 (manual), number of client threads: 1 (manual), server > connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, > send/receive buffer size (bytes): 0 (use Netty's default)] > 2018-12-10 12:20:22,166 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerServices - Temporary > file directory '/tmp': total 450 GB, usable 91 GB (20.22% usable) > 2018-12-10 12:20:22,211 INFO > org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 102 > MB for network buffer pool (number of memory segments: 3278, bytes per > segment: 32768). > 2018-12-10 12:20:22,256 INFO > org.apache.flink.runtime.query.QueryableStateUtils - Could not > load Queryable State Client Proxy. Probable reason: > flink-queryable-state-runtime is not in the classpath. To enable Queryable > State, please move the flink-queryable-state-runtime jar from the opt to the > lib folder. > 2018-12-10 12:20:22,256 INFO > org.apache.flink.runtime.query.QueryableStateUtils - Could not > load Queryable State Server. Probable reason: flink-queryable-state-runtime > is not in the classpath. To enable Queryable State, please move the > flink-queryable-state-runtime jar from the opt to the lib folder. > 2018-12-10 12:20:22,257 INFO > org.apache.flink.runtime.io.network.NetworkEnvironment - Starting the > network environment and its components. > 2018-12-10 12:20:22,289 INFO > org.apache.flink.runtime.io.network.netty.NettyClient - Successful > initialization (took 31 ms). > 2018-12-10 12:20:22,325 INFO > org.apache.flink.runtime.io.network.netty.NettyServer - Successful > initialization (took 35 ms). Listening on SocketAddress /127.0.1.1:46127. > 2018-12-10 12:20:22,326 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerServices - Limiting > managed memory to 0.7 of the currently free heap space (640 MB), memory will > be allocated lazily. > 2018-12-10 12:20:22,329 INFO > org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager > uses directory /tmp/flink-io-4f10dc60-3805-4c50-85a1-497c99dfb20c for spill > files. > 2018-12-10 12:20:22,387 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages > have a max timeout of 10000 ms > 2018-12-10 12:20:22,394 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC > endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at > akka://flink/user/taskmanager_0 . > 2018-12-10 12:20:22,406 INFO > org.apache.flink.runtime.taskexecutor.JobLeaderService - Start job > leader service. > 2018-12-10 12:20:22,407 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting to > ResourceManager > akka.tcp://flink@localhost:6123/user/resourcemanager(00000000000000000000000000000000). > 2018-12-10 12:20:22,409 INFO org.apache.flink.runtime.filecache.FileCache > - User file cache uses directory > /tmp/flink-dist-cache-058052c5-36cc-432f-88eb-8acf7dc5f1f1 > 2018-12-10 12:20:22,743 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Resolved > ResourceManager address, beginning registration > 2018-12-10 12:20:22,743 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration > at ResourceManager attempt 1 (timeout=100ms) > 2018-12-10 12:20:22,814 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Successful > registration at resource manager > akka.tcp://flink@localhost:6123/user/resourcemanager under registration id > ba9dd638db7ebccde63a3e0df420a990. > > On 12/10/2018 12:14 PM, Piotr Nowojski wrote: >> Hi, >> >> Have you checked task managers logs? >> >> Piotrek >> >>> On 8 Dec 2018, at 12:23, Alieh <sae...@informatik.uni-leipzig.de >>> <mailto:sae...@informatik.uni-leipzig.de>> wrote: >>> >>> Hello Piotrek, >>> >>> thank you for your answer. I installed a Flink on a local cluster and used >>> the GUI in order to monitor the task managers. It seems the program does >>> not start at all. The whole time just the job manager is struggling... For >>> very very toy examples, after a long time (during this time I see the job >>> manager logs as I mentioned before), the job is started and can be >>> executed in 2 seconds. >>> >>> Best, >>> >>> Alieh >>> >>> >>> On 12/07/2018 10:43 AM, Piotr Nowojski wrote: >>>> Hi, >>>> >>>> Please investigate logs/standard output/error from the task manager that >>>> has failed (the logs that you showed are from job manager). Probably there >>>> is some obvious error/exception explaining why has it failed. Most common >>>> reasons: >>>> - out of memory >>>> - long GC pause >>>> - seg fault or other error from some native library >>>> - task manager killed via for example SIGKILL >>>> >>>> Piotrek >>>> >>>>> On 6 Dec 2018, at 17:34, Alieh <sae...@informatik.uni-leipzig.de> >>>>> <mailto:sae...@informatik.uni-leipzig.de> wrote: >>>>> >>>>> Hello all, >>>>> >>>>> I have an algorithm x () which contains several joins and usage of 3 >>>>> times of gelly ConnectedComponents. The problem is that if I call x() >>>>> inside a script more than three times, I receive the messages listed >>>>> below in the log and the program is somehow stopped. It happens even if I >>>>> run it with a toy example of a graph with less that 10 vertices. Do you >>>>> have any clue what is the problem? >>>>> >>>>> Cheers, >>>>> >>>>> Alieh >>>>> >>>>> >>>>> 129149 [flink-akka.actor.default-dispatcher-20] DEBUG >>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - >>>>> Trigger heartbeat request. >>>>> 129149 [flink-akka.actor.default-dispatcher-20] DEBUG >>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - >>>>> Trigger heartbeat request. >>>>> 129150 [flink-akka.actor.default-dispatcher-20] DEBUG >>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor - Received heartbeat >>>>> request from e80ec35f3d0a04a68000ecbdc555f98b. >>>>> 129150 [flink-akka.actor.default-dispatcher-22] DEBUG >>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - >>>>> Received heartbeat from 78cdd7a4-0c00-4912-992f-a2990a5d46db. >>>>> 129151 [flink-akka.actor.default-dispatcher-22] DEBUG >>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - >>>>> Received new slot report from TaskManager >>>>> 78cdd7a4-0c00-4912-992f-a2990a5d46db. >>>>> 129151 [flink-akka.actor.default-dispatcher-22] DEBUG >>>>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - >>>>> Received slot report from instance 4c3e3654c11b09fbbf8e993a08a4c2da. >>>>> 129200 [flink-akka.actor.default-dispatcher-15] DEBUG >>>>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - >>>>> Release TaskExecutor 4c3e3654c11b09fbbf8e993a08a4c2da because it exceeded >>>>> the idle timeout. >>>>> 129200 [flink-akka.actor.default-dispatcher-15] DEBUG >>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - >>>>> Worker 78cdd7a4-0c00-4912-992f-a2990a5d46db could not be stopped. >>>>> >>> >> >