Hey,

Is that whole Task Manager log? Have you checked memory issues both on Task 
Managers and the Job Manager? Like out of memory/long GC pauses as I suggested 
in the first email? 

After you rule memory issues, you could capture couple of thread dumps (`kill 
-3 JVM_PID` or `jstack JVM_PID`) and check if any thread is stuck in your code.

Another potential issue, are you sure that you have a healthy network between 
nodes? No packet losts, low ping etc?

Piotrek

> On 10 Dec 2018, at 17:44, Alieh <sae...@informatik.uni-leipzig.de> wrote:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Hello,
> 
> this is the task manage log but it does not change after I run the program.  
> I think the Flink planner has problem with my program. It can not even start 
> the job.
> 
> Best,
> 
> Alieh
> 
> 
> 
> 018-12-10 12:20:20,386 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - 
> --------------------------------------------------------------------------------
> 2018-12-10 12:20:20,387 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  Starting 
> TaskManager (Version: 1.6.0, Rev:ff472b4, Date:07.08.2018 @ 13:31:13 UTC)
> 2018-12-10 12:20:20,387 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  OS current 
> user: alieh
> 2018-12-10 12:20:20,609 WARN  org.apache.hadoop.util.NativeCodeLoader         
>               - Unable to load native-hadoop library for your platform... 
> using builtin-java classes where applicable
> 2018-12-10 12:20:20,768 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  Current 
> Hadoop/Kerberos user: alieh
> 2018-12-10 12:20:20,769 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  JVM: Java 
> HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.161-b12
> 2018-12-10 12:20:20,769 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  Maximum heap 
> size: 922 MiBytes
> 2018-12-10 12:20:20,769 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  JAVA_HOME: 
> /usr/lib/jvm/java-8-oracle
> 2018-12-10 12:20:20,774 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  Hadoop 
> version: 2.4.1
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  JVM Options:
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     
> -XX:+UseG1GC
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     -Xms922M
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     -Xmx922M
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     
> -XX:MaxDirectMemorySize=8388607T
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     
> -Dlog.file=/home/alieh/flink-1.6.0/log/flink-alieh-taskexecutor-0-alieh-P67A-D3-B3.log
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     
> -Dlog4j.configuration=file:/home/alieh/flink-1.6.0/conf/log4j.properties 
> <file:///home/alieh/flink-1.6.0/conf/log4j.properties>
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     
> -Dlogback.configurationFile=file:/home/alieh/flink-1.6.0/conf/logback.xml 
> <file:///home/alieh/flink-1.6.0/conf/logback.xml>
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  Program 
> Arguments:
> 2018-12-10 12:20:20,776 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     
> --configDir
> 2018-12-10 12:20:20,776 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     
> /home/alieh/flink-1.6.0/conf
> 2018-12-10 12:20:20,776 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  Classpath: 
> /home/alieh/flink-1.6.0/lib/flink-python_2.11-1.6.0.jar:/home/alieh/flink-1.6.0/lib/flink-shaded-hadoop2-uber-1.6.0.jar:/home/alieh/flink-1.6.0/lib/log4j-1.2.17.jar:/home/alieh/flink-1.6.0/lib/slf4j-log4j12-1.7.7.jar:/home/alieh/flink-1.6.0/lib/flink-dist_2.11-1.6.0.jar:::
> 2018-12-10 12:20:20,776 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - 
> --------------------------------------------------------------------------------
> 2018-12-10 12:20:20,777 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Registered 
> UNIX signal handlers for [TERM, HUP, INT]
> 2018-12-10 12:20:20,785 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Maximum 
> number of open file descriptors is 1048576.
> 2018-12-10 12:20:20,803 INFO  
> org.apache.flink.configuration.GlobalConfiguration            - Loading 
> configuration property: jobmanager.rpc.address, localhost
> 2018-12-10 12:20:20,803 INFO  
> org.apache.flink.configuration.GlobalConfiguration            - Loading 
> configuration property: jobmanager.rpc.port, 6123
> 2018-12-10 12:20:20,803 INFO  
> org.apache.flink.configuration.GlobalConfiguration            - Loading 
> configuration property: jobmanager.heap.size, 1024m
> 2018-12-10 12:20:20,803 INFO  
> org.apache.flink.configuration.GlobalConfiguration            - Loading 
> configuration property: taskmanager.heap.size, 1024m
> 2018-12-10 12:20:20,803 INFO  
> org.apache.flink.configuration.GlobalConfiguration            - Loading 
> configuration property: taskmanager.numberOfTaskSlots, 1
> 2018-12-10 12:20:20,803 INFO  
> org.apache.flink.configuration.GlobalConfiguration            - Loading 
> configuration property: parallelism.default, 1
> 2018-12-10 12:20:20,804 INFO  
> org.apache.flink.configuration.GlobalConfiguration            - Loading 
> configuration property: rest.port, 8081
> 2018-12-10 12:20:20,912 INFO  
> org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user 
> set to alieh (auth:SIMPLE)
> 2018-12-10 12:20:21,131 WARN  org.apache.flink.configuration.Configuration    
>               - Config uses deprecated configuration key 
> 'jobmanager.rpc.address' instead of proper key 'rest.address'
> 2018-12-10 12:20:21,135 INFO  
> org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to 
> select the network interface and address to use by connecting to the leading 
> JobManager.
> 2018-12-10 12:20:21,136 INFO  
> org.apache.flink.runtime.util.LeaderRetrievalUtils            - TaskManager 
> will try to connect for 10000 milliseconds before falling back to heuristics
> 2018-12-10 12:20:21,145 INFO  org.apache.flink.runtime.net.ConnectionUtils    
>               - Retrieved new target address localhost/127.0.0.1:6123.
> 2018-12-10 12:20:21,204 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - TaskManager 
> will use hostname/address 'alieh-P67A-D3-B3' (127.0.1.1) for communication.
> 2018-12-10 12:20:21,208 INFO  
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils         - Starting 
> AkkaRpcService at alieh-p67a-d3-b3:0.
> 2018-12-10 12:20:21,805 INFO  akka.event.slf4j.Slf4jLogger                    
>               - Slf4jLogger started
> 2018-12-10 12:20:21,898 INFO  akka.remote.Remoting                            
>               - Starting remoting
> 2018-12-10 12:20:22,091 INFO  akka.remote.Remoting                            
>               - Remoting started; listening on addresses 
> :[akka.tcp://flink@alieh-p67a-d3-b3:44267]
> 2018-12-10 12:20:22,117 INFO  
> org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics 
> reporter configured, no metrics will be exposed/reported.
> 2018-12-10 12:20:22,124 INFO  
> org.apache.flink.runtime.blob.PermanentBlobCache              - Created BLOB 
> cache storage directory /tmp/blobStore-32ec7a05-737e-4b46-b716-3a0831683c47
> 2018-12-10 12:20:22,127 INFO  
> org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB 
> cache storage directory /tmp/blobStore-4b33c843-b7d3-45dc-814f-850e8c6be21a
> 2018-12-10 12:20:22,136 INFO  
> org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig 
> [server address: alieh-P67A-D3-B3/127.0.1.1, server port: 0, ssl enabled: 
> false, memory segment size (bytes): 32768, transport type: NIO, number of 
> server threads: 1 (manual), number of client threads: 1 (manual), server 
> connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, 
> send/receive buffer size (bytes): 0 (use Netty's default)]
> 2018-12-10 12:20:22,166 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerServices     - Temporary 
> file directory '/tmp': total 450 GB, usable 91 GB (20.22% usable)
> 2018-12-10 12:20:22,211 INFO  
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 102 
> MB for network buffer pool (number of memory segments: 3278, bytes per 
> segment: 32768).
> 2018-12-10 12:20:22,256 INFO  
> org.apache.flink.runtime.query.QueryableStateUtils            - Could not 
> load Queryable State Client Proxy. Probable reason: 
> flink-queryable-state-runtime is not in the classpath. To enable Queryable 
> State, please move the flink-queryable-state-runtime jar from the opt to the 
> lib folder.
> 2018-12-10 12:20:22,256 INFO  
> org.apache.flink.runtime.query.QueryableStateUtils            - Could not 
> load Queryable State Server. Probable reason: flink-queryable-state-runtime 
> is not in the classpath. To enable Queryable State, please move the 
> flink-queryable-state-runtime jar from the opt to the lib folder.
> 2018-12-10 12:20:22,257 INFO  
> org.apache.flink.runtime.io.network.NetworkEnvironment        - Starting the 
> network environment and its components.
> 2018-12-10 12:20:22,289 INFO  
> org.apache.flink.runtime.io.network.netty.NettyClient         - Successful 
> initialization (took 31 ms).
> 2018-12-10 12:20:22,325 INFO  
> org.apache.flink.runtime.io.network.netty.NettyServer         - Successful 
> initialization (took 35 ms). Listening on SocketAddress /127.0.1.1:46127.
> 2018-12-10 12:20:22,326 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerServices     - Limiting 
> managed memory to 0.7 of the currently free heap space (640 MB), memory will 
> be allocated lazily.
> 2018-12-10 12:20:22,329 INFO  
> org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager 
> uses directory /tmp/flink-io-4f10dc60-3805-4c50-85a1-497c99dfb20c for spill 
> files.
> 2018-12-10 12:20:22,387 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration  - Messages 
> have a max timeout of 10000 ms
> 2018-12-10 12:20:22,394 INFO  
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC 
> endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at 
> akka://flink/user/taskmanager_0 .
> 2018-12-10 12:20:22,406 INFO  
> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job 
> leader service.
> 2018-12-10 12:20:22,407 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting to 
> ResourceManager 
> akka.tcp://flink@localhost:6123/user/resourcemanager(00000000000000000000000000000000).
> 2018-12-10 12:20:22,409 INFO  org.apache.flink.runtime.filecache.FileCache    
>               - User file cache uses directory 
> /tmp/flink-dist-cache-058052c5-36cc-432f-88eb-8acf7dc5f1f1
> 2018-12-10 12:20:22,743 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Resolved 
> ResourceManager address, beginning registration
> 2018-12-10 12:20:22,743 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Registration 
> at ResourceManager attempt 1 (timeout=100ms)
> 2018-12-10 12:20:22,814 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Successful 
> registration at resource manager 
> akka.tcp://flink@localhost:6123/user/resourcemanager under registration id 
> ba9dd638db7ebccde63a3e0df420a990.
> 
> On 12/10/2018 12:14 PM, Piotr Nowojski wrote:
>> Hi,
>> 
>> Have you checked task managers logs?
>> 
>> Piotrek
>> 
>>> On 8 Dec 2018, at 12:23, Alieh <sae...@informatik.uni-leipzig.de 
>>> <mailto:sae...@informatik.uni-leipzig.de>> wrote:
>>> 
>>> Hello Piotrek,
>>> 
>>> thank you for your answer. I installed a Flink on a local cluster and used 
>>> the GUI in order to monitor the task managers. It seems the program does 
>>> not start at all. The whole time just the job manager is struggling... For 
>>> very very toy examples, after a long time (during this time I see the job 
>>> manager logs as I mentioned before),  the job is started and can be 
>>> executed in 2 seconds.  
>>> 
>>> Best,
>>> 
>>> Alieh
>>> 
>>> 
>>> On 12/07/2018 10:43 AM, Piotr Nowojski wrote:
>>>> Hi,
>>>> 
>>>> Please investigate logs/standard output/error from the task manager that 
>>>> has failed (the logs that you showed are from job manager). Probably there 
>>>> is some obvious error/exception explaining why has it failed. Most common 
>>>> reasons:
>>>> - out of memory
>>>> - long GC pause
>>>> - seg fault or other error from some native library
>>>> - task manager killed via for example SIGKILL
>>>> 
>>>> Piotrek
>>>> 
>>>>> On 6 Dec 2018, at 17:34, Alieh <sae...@informatik.uni-leipzig.de> 
>>>>> <mailto:sae...@informatik.uni-leipzig.de> wrote:
>>>>> 
>>>>> Hello all,
>>>>> 
>>>>> I have an algorithm x () which contains several joins and usage of 3 
>>>>> times of gelly ConnectedComponents. The problem is that if I call x() 
>>>>> inside a script more than three times, I receive the messages listed 
>>>>> below in the log and the program is somehow stopped. It happens even if I 
>>>>> run it with a toy example of a graph with less that 10 vertices. Do you 
>>>>> have any clue what is the problem?
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Alieh
>>>>> 
>>>>> 
>>>>> 129149 [flink-akka.actor.default-dispatcher-20] DEBUG 
>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - 
>>>>> Trigger heartbeat request.
>>>>> 129149 [flink-akka.actor.default-dispatcher-20] DEBUG 
>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - 
>>>>> Trigger heartbeat request.
>>>>> 129150 [flink-akka.actor.default-dispatcher-20] DEBUG 
>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor  - Received heartbeat 
>>>>> request from e80ec35f3d0a04a68000ecbdc555f98b.
>>>>> 129150 [flink-akka.actor.default-dispatcher-22] DEBUG 
>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - 
>>>>> Received heartbeat from 78cdd7a4-0c00-4912-992f-a2990a5d46db.
>>>>> 129151 [flink-akka.actor.default-dispatcher-22] DEBUG 
>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - 
>>>>> Received new slot report from TaskManager 
>>>>> 78cdd7a4-0c00-4912-992f-a2990a5d46db.
>>>>> 129151 [flink-akka.actor.default-dispatcher-22] DEBUG 
>>>>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - 
>>>>> Received slot report from instance 4c3e3654c11b09fbbf8e993a08a4c2da.
>>>>> 129200 [flink-akka.actor.default-dispatcher-15] DEBUG 
>>>>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - 
>>>>> Release TaskExecutor 4c3e3654c11b09fbbf8e993a08a4c2da because it exceeded 
>>>>> the idle timeout.
>>>>> 129200 [flink-akka.actor.default-dispatcher-15] DEBUG 
>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - 
>>>>> Worker 78cdd7a4-0c00-4912-992f-a2990a5d46db could not be stopped.
>>>>> 
>>> 
>> 
> 

Reply via email to