I used streaming WordCount provided by Flink and the file contains text like "This is some text...". I just copied several times.

Best,

Habib

On 11/1/2019 6:03 AM, Zhenghua Gao wrote:
2019-10-30 15:59:52,122 INFO  org.apache.flink.runtime.taskmanager.Task            
         - Split Reader: Custom File Source -> Flat Map (1/1) 
(6a17c410c3e36f524bb774d2dffed4a4) switched from DEPLOYING to RUNNING.
2019-10-30 17:45:10,943 INFO  org.apache.flink.runtime.taskmanager.Task            
         - Split Reader: Custom File Source -> Flat Map (1/1) 
(6a17c410c3e36f524bb774d2dffed4a4) switched from RUNNING to FINISHED.
It's surprise that the source task uses 95 mins to read a 2G file.
Could you give me your code snippets and some sample lines of the 2G file?
I will try to reproduce your scenario and dig the root causes.
*Best Regards,*
*Zhenghua Gao*


On Thu, Oct 31, 2019 at 9:05 PM Habib Mostafaei <ha...@inet.tu-berlin.de <mailto:ha...@inet.tu-berlin.de>> wrote:

    I enclosed all logs from the run and for this run I used
    parallelism one. However, for other runs I checked and found that
    all parallel workers were working properly. Is there a simple way
    to get profiling information in Flink?

    Best,

    Habib

    On 10/31/2019 2:54 AM, Zhenghua Gao wrote:
    I think more runtime information would help figure
    outwheretheproblem is.
    1) how many parallelisms actually working
    2) the metrics for each operator
    3) the jvm profiling information, etc

    *Best Regards,*
    *Zhenghua Gao*


    On Wed, Oct 30, 2019 at 8:25 PM Habib Mostafaei
    <ha...@inet.tu-berlin.de <mailto:ha...@inet.tu-berlin.de>> wrote:

        Thanks Gao for the reply. I used the parallelism parameter
        with different values like 6 and 8 but still the execution
        time is not comparable with a single threaded python script.
        What would be the reasonable value for the parallelism?

        Best,

        Habib

        On 10/30/2019 1:17 PM, Zhenghua Gao wrote:
        The reason might be the parallelism of your task is only 1,
        that's too low.
        See [1] to specify proper parallelism  for your job, and the
        execution time should be reduced significantly.

        [1]
        https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html

        *Best Regards,*
        *Zhenghua Gao*


        On Tue, Oct 29, 2019 at 9:27 PM Habib Mostafaei
        <ha...@inet.tu-berlin.de <mailto:ha...@inet.tu-berlin.de>>
        wrote:

            Hi all,

            I am running Flink on a standalone cluster and getting
            very long
            execution time for the streaming queries like WordCount
            for a fixed text
            file. My VM runs on a Debian 10 with 16 cpu cores and
            32GB of RAM. I
            have a text file with size of 2GB. When I run the Flink
            on a standalone
            cluster, i.e., one JobManager and one taskManager with
            25GB of heapsize,
            it took around two hours to finish counting this file
            while a simple
            python script can do it in around 7 minutes. Just
            wondering what is
            wrong with my setup. I ran the experiments on a cluster
            with six
            taskManagers, but I still get very long execution time
            like 25 minutes
            or so. I tried to increase the JVM heap size to have
            lower execution
            time but it did not help. I attached the log file and
            the Flink
            configuration file to this email.

            Best,

            Habib


Reply via email to