Hi, I ran the streaming WordCount with a 2GB text file(copied /usr/share/dict/words 400 times) last weekend and didn't reproduce your result(16 minutes in my case). But i find some clues may help you:
The streaming WordCount job would output all intermedia result in your output file(if specified) or taskmanager.out. It's large (about 4GB in my case) and causes the disk writes high. *Best Regards,* *Zhenghua Gao* On Fri, Nov 1, 2019 at 4:40 PM Habib Mostafaei <ha...@inet.tu-berlin.de> wrote: > I used streaming WordCount provided by Flink and the file contains text > like "This is some text...". I just copied several times. > > Best, > > Habib > On 11/1/2019 6:03 AM, Zhenghua Gao wrote: > > 2019-10-30 15:59:52,122 INFO org.apache.flink.runtime.taskmanager.Task > - Split Reader: Custom File Source -> Flat Map (1/1) > (6a17c410c3e36f524bb774d2dffed4a4) switched from DEPLOYING to RUNNING. > > 2019-10-30 17:45:10,943 INFO org.apache.flink.runtime.taskmanager.Task > - Split Reader: Custom File Source -> Flat Map (1/1) > (6a17c410c3e36f524bb774d2dffed4a4) switched from RUNNING to FINISHED. > > It's surprise that the source task uses 95 mins to read a 2G file. > > Could you give me your code snippets and some sample lines of the 2G file? > > I will try to reproduce your scenario and dig the root causes. > > *Best Regards,* > *Zhenghua Gao* > > > On Thu, Oct 31, 2019 at 9:05 PM Habib Mostafaei <ha...@inet.tu-berlin.de> > wrote: > >> I enclosed all logs from the run and for this run I used parallelism one. >> However, for other runs I checked and found that all parallel workers were >> working properly. Is there a simple way to get profiling information in >> Flink? >> >> Best, >> >> Habib >> On 10/31/2019 2:54 AM, Zhenghua Gao wrote: >> >> I think more runtime information would help figure out where the problem >> is. >> 1) how many parallelisms actually working >> 2) the metrics for each operator >> 3) the jvm profiling information, etc >> >> *Best Regards,* >> *Zhenghua Gao* >> >> >> On Wed, Oct 30, 2019 at 8:25 PM Habib Mostafaei <ha...@inet.tu-berlin.de> >> wrote: >> >>> Thanks Gao for the reply. I used the parallelism parameter with >>> different values like 6 and 8 but still the execution time is not >>> comparable with a single threaded python script. What would be the >>> reasonable value for the parallelism? >>> >>> Best, >>> >>> Habib >>> On 10/30/2019 1:17 PM, Zhenghua Gao wrote: >>> >>> The reason might be the parallelism of your task is only 1, that's too >>> low. >>> See [1] to specify proper parallelism for your job, and the execution >>> time should be reduced significantly. >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html >>> >>> *Best Regards,* >>> *Zhenghua Gao* >>> >>> >>> On Tue, Oct 29, 2019 at 9:27 PM Habib Mostafaei <ha...@inet.tu-berlin.de> >>> wrote: >>> >>>> Hi all, >>>> >>>> I am running Flink on a standalone cluster and getting very long >>>> execution time for the streaming queries like WordCount for a fixed >>>> text >>>> file. My VM runs on a Debian 10 with 16 cpu cores and 32GB of RAM. I >>>> have a text file with size of 2GB. When I run the Flink on a standalone >>>> cluster, i.e., one JobManager and one taskManager with 25GB of >>>> heapsize, >>>> it took around two hours to finish counting this file while a simple >>>> python script can do it in around 7 minutes. Just wondering what is >>>> wrong with my setup. I ran the experiments on a cluster with six >>>> taskManagers, but I still get very long execution time like 25 minutes >>>> or so. I tried to increase the JVM heap size to have lower execution >>>> time but it did not help. I attached the log file and the Flink >>>> configuration file to this email. >>>> >>>> Best, >>>> >>>> Habib >>>> >>>> >>