On 23 Jun 2015, at 13:53, Stephan Ewen <se...@apache.org> wrote: > Currently, Flink does not cache anything across runs, except JAR files on the > workers. > > The reason the first run is slower may be: > - Because in the first run, code is distributed in the cluster. In > subsequent runs, the JAR files need not be redistributed. > - Because the JIT takes a bit to kick in and compile code in the first run. > In subsequent runs, the code is already JIT-ted. > > > The system should not freeze after 100 runs. Can you tell us a bit more of > what you see? Can you identify which process hangs and send us a stack-trace > of that one? Then we could look into this...
If you have access to the task manager instances, you can do a `jps` to get the PID of the task manager and then you can do `jstack PID`. $ jps 16242 Jps 89107 TaskManager $ jstack 89107 [stack trace] Would be great if you could share this after the task managers freeze. - Can you also provide some information on your setup (what job? how many task managers? etc.) so that I can try to reproduce this?