I think more runtime information would help figure out where the problem is. 1) how many parallelisms actually working 2) the metrics for each operator 3) the jvm profiling information, etc
*Best Regards,* *Zhenghua Gao* On Wed, Oct 30, 2019 at 8:25 PM Habib Mostafaei <ha...@inet.tu-berlin.de> wrote: > Thanks Gao for the reply. I used the parallelism parameter with different > values like 6 and 8 but still the execution time is not comparable with a > single threaded python script. What would be the reasonable value for the > parallelism? > > Best, > > Habib > On 10/30/2019 1:17 PM, Zhenghua Gao wrote: > > The reason might be the parallelism of your task is only 1, that's too > low. > See [1] to specify proper parallelism for your job, and the execution > time should be reduced significantly. > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html > > *Best Regards,* > *Zhenghua Gao* > > > On Tue, Oct 29, 2019 at 9:27 PM Habib Mostafaei <ha...@inet.tu-berlin.de> > wrote: > >> Hi all, >> >> I am running Flink on a standalone cluster and getting very long >> execution time for the streaming queries like WordCount for a fixed text >> file. My VM runs on a Debian 10 with 16 cpu cores and 32GB of RAM. I >> have a text file with size of 2GB. When I run the Flink on a standalone >> cluster, i.e., one JobManager and one taskManager with 25GB of heapsize, >> it took around two hours to finish counting this file while a simple >> python script can do it in around 7 minutes. Just wondering what is >> wrong with my setup. I ran the experiments on a cluster with six >> taskManagers, but I still get very long execution time like 25 minutes >> or so. I tried to increase the JVM heap size to have lower execution >> time but it did not help. I attached the log file and the Flink >> configuration file to this email. >> >> Best, >> >> Habib >> >> -- > Habib Mostafaei, Ph.D. > Postdoctoral researcher > TU Berlin, > FG INET, MAR 4.003 > Marchstraße 23, 10587 Berlin > >