Re: low performance in running queries

Piotr Nowojski Wed, 30 Oct 2019 07:29:21 -0700

Hi,

I would also suggest to just attach a code profiler to the process during those 
2 hours and gather some results. It might answer some questions what is taking 
so long time.


Piotrek

> On 30 Oct 2019, at 15:11, Chris Miller <[email protected]> wrote:
> 
> I haven't run any benchmarks with Flink or even used it enough to directly 
> help with your question, however I suspect that the following article might 
> be relevant:
> 
> http://dsrg.pdos.csail.mit.edu/2016/06/26/scalability-cost/ 
> <http://dsrg.pdos.csail.mit.edu/2016/06/26/scalability-cost/>
> 
> Given the computation you're performing is trivial, it's possible that the 
> additional overhead of serialisation, interprocess communication, state 
> management etc that distributed systems like Flink require are dominating the 
> runtime here. 2 hours (or even 25 minutes) still seems too long to me 
> however, so hopefully it really is just a configuration issue of some sort. 
> Either way, if you do figure this out or anyone with good knowledge of the 
> article above in relation to Flink is able to give their thoughts, I'd be 
> very interested in hearing more.
> 
> Regards,
> Chris
> 
> 
> ------ Original Message ------
> From: "Habib Mostafaei" <[email protected] 
> <mailto:[email protected]>>
> To: "Zhenghua Gao" <[email protected] <mailto:[email protected]>>
> Cc: "user" <[email protected] <mailto:[email protected]>>; "Georgios 
> Smaragdakis" <[email protected] 
> <mailto:[email protected]>>; "Niklas Semmler" 
> <[email protected] <mailto:[email protected]>>
> Sent: 30/10/2019 12:25:28
> Subject: Re: low performance in running queries
> 
>> Thanks Gao for the reply. I used the parallelism parameter with different 
>> values like 6 and 8 but still the execution time is not comparable with a 
>> single threaded python script. What would be the reasonable value for the 
>> parallelism?
>> 
>> Best,
>> 
>> Habib
>> 
>> On 10/30/2019 1:17 PM, Zhenghua Gao wrote:
>>> The reason might be the parallelism of your task is only 1, that's too low.
>>> See [1] to specify proper parallelism  for your job, and the execution time 
>>> should be reduced significantly.
>>> 
>>> [1] 
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html 
>>> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html>
>>> 
>>> Best Regards,
>>> Zhenghua Gao
>>> 
>>> 
>>> On Tue, Oct 29, 2019 at 9:27 PM Habib Mostafaei <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Hi all,
>>> 
>>> I am running Flink on a standalone cluster and getting very long 
>>> execution time for the streaming queries like WordCount for a fixed text 
>>> file. My VM runs on a Debian 10 with 16 cpu cores and 32GB of RAM. I 
>>> have a text file with size of 2GB. When I run the Flink on a standalone 
>>> cluster, i.e., one JobManager and one taskManager with 25GB of heapsize, 
>>> it took around two hours to finish counting this file while a simple 
>>> python script can do it in around 7 minutes. Just wondering what is 
>>> wrong with my setup. I ran the experiments on a cluster with six 
>>> taskManagers, but I still get very long execution time like 25 minutes 
>>> or so. I tried to increase the JVM heap size to have lower execution 
>>> time but it did not help. I attached the log file and the Flink 
>>> configuration file to this email.
>>> 
>>> Best,
>>> 
>>> Habib
>>> 
>> -- 
>> Habib Mostafaei, Ph.D.
>> Postdoctoral researcher
>> TU Berlin,
>> FG INET, MAR 4.003
>> Marchstraße 23, 10587 Berlin

Re: low performance in running queries

Reply via email to