Re: low performance in running queries

Zhenghua Gao Sun, 03 Nov 2019 22:45:35 -0800

Hi,

I ran the streaming WordCount with a 2GB text file(copied
/usr/share/dict/words 400 times) last weekend and didn't reproduce your
result(16 minutes in my case).
But i find some clues may help you:


The streaming WordCount job would output all intermedia result in your
output file(if specified) or taskmanager.out.
It's large (about 4GB in my case) and causes the disk writes high.


*Best Regards,*
*Zhenghua Gao*


On Fri, Nov 1, 2019 at 4:40 PM Habib Mostafaei <ha...@inet.tu-berlin.de>
wrote:

> I used streaming WordCount provided by Flink and the file contains text
> like "This is some text...". I just copied several times.
>
> Best,
>
> Habib
> On 11/1/2019 6:03 AM, Zhenghua Gao wrote:
>
> 2019-10-30 15:59:52,122 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Split Reader: Custom File Source -> Flat Map (1/1) 
> (6a17c410c3e36f524bb774d2dffed4a4) switched from DEPLOYING to RUNNING.
>
> 2019-10-30 17:45:10,943 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Split Reader: Custom File Source -> Flat Map (1/1) 
> (6a17c410c3e36f524bb774d2dffed4a4) switched from RUNNING to FINISHED.
>
> It's surprise that the source task uses 95 mins to read a 2G file.
>
> Could you give me your code snippets and some sample lines of the 2G file?
>
> I will try to reproduce your scenario and dig the root causes.
>
> *Best Regards,*
> *Zhenghua Gao*
>
>
> On Thu, Oct 31, 2019 at 9:05 PM Habib Mostafaei <ha...@inet.tu-berlin.de>
> wrote:
>
>> I enclosed all logs from the run and for this run I used parallelism one.
>> However, for other runs I checked and found that all parallel workers were
>> working properly. Is there a simple way to get profiling information in
>> Flink?
>>
>> Best,
>>
>> Habib
>> On 10/31/2019 2:54 AM, Zhenghua Gao wrote:
>>
>> I think more runtime information would help figure out where the problem
>>  is.
>> 1) how many parallelisms actually working
>> 2) the metrics for each operator
>> 3) the jvm profiling information, etc
>>
>> *Best Regards,*
>> *Zhenghua Gao*
>>
>>
>> On Wed, Oct 30, 2019 at 8:25 PM Habib Mostafaei <ha...@inet.tu-berlin.de>
>> wrote:
>>
>>> Thanks Gao for the reply. I used the parallelism parameter with
>>> different values like 6 and 8 but still the execution time is not
>>> comparable with a single threaded python script. What would be the
>>> reasonable value for the parallelism?
>>>
>>> Best,
>>>
>>> Habib
>>> On 10/30/2019 1:17 PM, Zhenghua Gao wrote:
>>>
>>> The reason might be the parallelism of your task is only 1, that's too
>>> low.
>>> See [1] to specify proper parallelism  for your job, and the execution
>>> time should be reduced significantly.
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html
>>>
>>> *Best Regards,*
>>> *Zhenghua Gao*
>>>
>>>
>>> On Tue, Oct 29, 2019 at 9:27 PM Habib Mostafaei <ha...@inet.tu-berlin.de>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am running Flink on a standalone cluster and getting very long
>>>> execution time for the streaming queries like WordCount for a fixed
>>>> text
>>>> file. My VM runs on a Debian 10 with 16 cpu cores and 32GB of RAM. I
>>>> have a text file with size of 2GB. When I run the Flink on a standalone
>>>> cluster, i.e., one JobManager and one taskManager with 25GB of
>>>> heapsize,
>>>> it took around two hours to finish counting this file while a simple
>>>> python script can do it in around 7 minutes. Just wondering what is
>>>> wrong with my setup. I ran the experiments on a cluster with six
>>>> taskManagers, but I still get very long execution time like 25 minutes
>>>> or so. I tried to increase the JVM heap size to have lower execution
>>>> time but it did not help. I attached the log file and the Flink
>>>> configuration file to this email.
>>>>
>>>> Best,
>>>>
>>>> Habib
>>>>
>>>>
>>

Re: low performance in running queries

Reply via email to