Re: REST: reading completed jobs' details

Chesnay Schepler Thu, 06 Sep 2018 09:57:37 -0700

Did you by chance use the RemoteEnvironment and pass in 6123 as theport? If so, try using 8081 instead, which is the REST port.


On 06.09.2018 18:24, Miguel Coimbra wrote:

Hello Chesnay,


Thanks for the information.

Decided to move straight away to launching a standalone cluster.

I'm now having another problem when trying to submit a job through myJava program after launching the standalone cluster.

I configured the cluster (flink-1.6.0/conf/flink-conf.yaml) to use 2TaskManager instances and assigned port ranges for most Flink clusterentities (to avoid port collisions with more than 1 TaskManager):


query.server.ports: 30000-35000
query.proxy.ports: 35001-40000
taskmanager.rpc.port: 45001-50000
taskmanager.data.port: 50001-55000
blob.server.port: 55001-60000

I'm launching in Linux with:

./start-cluster.sh

Starting cluster.
Starting standalonesession daemon on host xxxxxxx.
Starting taskexecutor daemon on host xxxxxxx.
[INFO] 1 instance(s) of taskexecutor are already running on xxxxxxx.
Starting taskexecutor daemon on host xxxxxxx.

However, my Java program ends up hanging as soon as I performanexecute() call (for example by calling count() on a DataSet).

Checking the JobManager log, I find the following exception whenevermy Java program calls execute() over the ExecutionEnvironment (eitherusing Maven on the terminal or from IntelliJ IDEA):

WARN akka.remote.transport.netty.NettyTransport - Remote connection to[/127.0.0.1:47774 <http://127.0.0.1:47774>] failed withorg.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException:Adjusted frame length exceeds 10485760: 1347375960 - discarded

I checked that the problem is happening on a count(), so I don't thinkit has to do with the JobManager/TaskManagers trying to exchangeexcessively-big messages.

While searching, I tried to make sure my program compiles with thesame library versions as those in this cluster version of Flink.



I downloaded the Apache Flink 1.6 binaries to launch the cluster:


https://www.apache.org/dyn/closer.lua/flink/flink-1.6.0/flink-1.6.0-bin-scala_2.11.tgz

I then checked the library versions used in the pom.xml of the 1.6.0branch of the Flink repository:



https://github.com/apache/flink/blob/release-1.6/pom.xml

On my project's pom.xml, I have the following:

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <flink.version>1.6.0</flink.version><slf4j.version>1.7.7</slf4j.version>
    <log4j.version>1.2.17</log4j.version>
    <scala.version>2.11.12</scala.version>
    <scala.binary.version>2.11</scala.binary.version>
    <akka.version>2.4.20</akka.version>
    <junit.version>4.12</junit.version>
    <junit.jupiter.version>5.0.0</junit.jupiter.version>
    <junit.vintage.version>${junit.version}.1</junit.vintage.version>
    <junit.platform.version>1.0.1</junit.platform.version>
    <aspectj.version>1.9.1</aspectj.version>
</properties>

My project's dependency versions match those of the Flink 1.6repository (for libraries such as akka).However, I'm having difficulty understanding what else may be causingthis problem.


Thanks for your attention.

Best,

On Wed, 5 Sep 2018 at 20:18, Chesnay Schepler <ches...@apache.org<mailto:ches...@apache.org>> wrote:


    No, the cluster isn't shared. For each job a separate cluster is
    spun up when calling execute(), at the end of which it is shut down.

    For explicitly creation and shutdown of a cluster I would suggest
    to execute your jobs as a test that contains a MiniClusterResource.

    On 05.09.2018 20:59, Miguel Coimbra wrote:

    Thanks for the reply.

    However, I think my case differs because I am running a sequence
    of independent Flink jobs on the same environment instance.
    I only create the LocalExecutionEnvironment once.

    The web manager shows the job ID changing correctly every time a
    new job is executed.

    Since it is the same execution environment (and therefore the
    same cluster instance I imagine), those completed jobs should
    show as well, no?

    On Wed, 5 Sep 2018 at 18:40, Chesnay Schepler <ches...@apache.org
    <mailto:ches...@apache.org>> wrote:

        When you create an environment that way, then the cluster is
        shutdown once the job completes.
        The WebUI can _appear_ as still working since all the files,
        and data about the job, is cached in the browser.

        On 05.09.2018 17:39, Miguel Coimbra wrote:

        Hello,

        I'm having difficulty reading the status (such as time taken
        for each dataflow operator in a job) of jobs that have
        completed.

        First, when I click on "Completed jobs" on the web interface
        (by default at 8081), no job shows up.
        I see jobs that exist as "Running", but as soon as they
        finish, I would expect them to appear in the "Complete jobs"
        section, but no luck.

        Consider that I am running locally (web UI is running, I
        checked and it is available via browser) on 8081.
        None of these links worked for checking jobs that have
        already finished, such as the job ID
        618fac9da6ea458f5091a9c40e54cbcc that had been running:

        http://127.0.0.1:8081/jobs/618fac9da6ea458f5091a9c40e54cbcc
        http://127.0.0.1:8081/completed-jobs/618fac9da6ea458f5091a9c40e54cbcc

        I'm running with a LocalExecutionEnvironment with with the
        method:
        ExecutionEnvironment.createLocalEnvironmentWithWebUI(conf)
        I hope anyone may be able to help.

        Best,

Re: REST: reading completed jobs' details

Reply via email to