Re: Job Manager Configuration

Till Rohrmann Thu, 09 Nov 2017 01:36:13 -0800

That is the question I hope to be able to answer with the logs. Let's see
what they say.


Cheers,
Till

On Wed, Nov 8, 2017 at 7:24 PM, Chan, Regina <regina.c...@gs.com> wrote:

> Thanks for the responses!
>
>
>
> I’m currently using 1.2.0 – going to bump it up once I have things
> stabilized. I haven’t defined any slot sharing groups but I do think that
> I’ve probably got my job configured sub optimally. I’ve refactored my code
> so that I can submit subsets of the flow at a time and it seems to work.
> The break between the JobManager able to acknowledge job and not seems to
> hover somewhere between 10-20 flows.
>
>
>
> I guess what doesn’t make too much sense to me is if the user code is
> uploaded once to the JobManager and downloaded from each TaskManager, what
> exactly is the JobManager doing that’s keeping it busy? It’s the same code
> across the TaskManagers.
>
>
>
> I’ll get you the logs shortly.
>
>
>
> *From:* Till Rohrmann [mailto:trohrm...@apache.org]
> *Sent:* Wednesday, November 08, 2017 10:17 AM
> *To:* Chan, Regina [Tech]
> *Cc:* Chesnay Schepler; user@flink.apache.org
>
> *Subject:* Re: Job Manager Configuration
>
>
>
> Quick question Regina: Which version of Flink are you running?
>
>
>
> Cheers,
> Till
>
>
>
> On Tue, Nov 7, 2017 at 4:38 PM, Till Rohrmann <till.rohrm...@gmail.com>
> wrote:
>
> Hi Regina,
>
>
>
> the user code is uploaded once to the `JobManager` and then downloaded
> from each `TaskManager` once when it first receives the command to execute
> the first task of your job.
>
>
>
> As Chesnay said there is no fundamental limitation to the size of the
> Flink job. However, it might be the case that you have configured your job
> sub-optimally. You said that you have 300 parallel flows. Depending on
> whether you've defined separate slot sharing groups for them or not, it
> might be the case that parallel subtasks of all 300 parallel jobs share the
> same slot (if you haven't changed the slot sharing group). Depending on
> what you calculate, this can be inefficient because the individual tasks
> don't get much computation time. Moreover, all tasks will allocate some
> objects on the heap which can lead to more GC. Therefore, it might make
> sense to group some of the jobs together and run these jobs in batches
> after the previous batch completed. But this is hard to say without knowing
> the details of your job and getting a glimpse at the JobManager logs.
>
>
>
> Concerning the exception you're seeing, it would also be helpful to see
> the logs of the client and the JobManager. Actually, the scheduling of the
> job is independent of the response. Only the creation of the ExecutionGraph
> and making the JobGraph highly available in case of an HA setup are
> executed before the JobManager acknowledges the job submission. Only if
> this acknowledge message is not received in time on the client side, then
> the SubmissionTimeoutException is thrown. Therefore, I assume that somehow
> the JobManager is too busy or kept from sending the acknowledge message.
>
>
>
> Cheers,
>
> Till
>
>
>
>
>
>
>
> On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina <regina.c...@gs.com> wrote:
>
> Does it copy per TaskManager or per operator? I only gave it 10
> TaskManagers with 2 slots. I’m perfectly fine with it queuing up and
> running when it has the resources to.
>
>
>
>
>
>
>
> *From:* Chesnay Schepler [mailto:ches...@apache.org]
> *Sent:* Wednesday, November 01, 2017 7:09 AM
> *To:* user@flink.apache.org
> *Subject:* Re: Job Manager Configuration
>
>
>
> AFAIK there is no theoretical limit on the size of the plan, it just
> depends on the available resources.
>
>
>
> The job submissions times out since it takes too long to deploy all the
> operators that the job defines. With 300 flows, each with 6 operators
> you're looking at potentially (1800 * parallelism) tasks that have to be
> deployed. For each task Flink copies the user-code of *all* flows to the
> executing TaskManager, which the network may just not be handle in time.
>
> I suggest to split your job into smaller batches or even run each of them
> independently.
>
> On 31.10.2017 16:25, Chan, Regina wrote:
>
> Asking an additional question, what is the largest plan that the
> JobManager can handle? Is there a limit? My flows don’t need to run in
> parallel and can run independently. I wanted them to run in one single job
> because it’s part of one logical commit on my side.
>
>
>
> Thanks,
>
> Regina
>
>
>
> *From:* Chan, Regina [Tech]
> *Sent:* Monday, October 30, 2017 3:22 PM
> *To:* 'user@flink.apache.org'
> *Subject:* Job Manager Configuration
>
>
>
> Flink Users,
>
>
>
> I have about 300 parallel flows in one job each with 2 inputs, 3
> operators, and 1 sink which makes for a large job. I keep getting the below
> timeout exception but I’ve already set it to a 30 minute time out with a
> 6GB heap on the JobManager? Is there a heuristic to better configure the
> job manager?
>
>
>
> Caused by: 
> org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException:
> Job submission to the JobManager timed out. You may increase
> 'akka.client.timeout' in case the JobManager needs more time to configure
> and confirm the job submission.
>
>
>
> *Regina Chan*
>
> *Goldman Sachs** –* Enterprise Platforms, Data Architecture
>
> *30 Hudson Street, 37th floor | Jersey City, NY 07302
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D30-2BHudson-2BStreet-2C-2B37th-2Bfloor-2B-257C-2BJersey-2BCity-2C-2BNY-2B07302-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=laVSZTJAQISd6BKl5JXEilWYowD61y4Ai_UMr2jf_9c&s=X1OLt2bSLUDeiuNf2MdsX_68SjaV87OwfR1puLmsKlc&e=>*
> (  (212) 902-5697
>
>
>
>
>
>
>
>
>

Re: Job Manager Configuration

Reply via email to