That is the question I hope to be able to answer with the logs. Let's see what they say.
Cheers, Till On Wed, Nov 8, 2017 at 7:24 PM, Chan, Regina <regina.c...@gs.com> wrote: > Thanks for the responses! > > > > I’m currently using 1.2.0 – going to bump it up once I have things > stabilized. I haven’t defined any slot sharing groups but I do think that > I’ve probably got my job configured sub optimally. I’ve refactored my code > so that I can submit subsets of the flow at a time and it seems to work. > The break between the JobManager able to acknowledge job and not seems to > hover somewhere between 10-20 flows. > > > > I guess what doesn’t make too much sense to me is if the user code is > uploaded once to the JobManager and downloaded from each TaskManager, what > exactly is the JobManager doing that’s keeping it busy? It’s the same code > across the TaskManagers. > > > > I’ll get you the logs shortly. > > > > *From:* Till Rohrmann [mailto:trohrm...@apache.org] > *Sent:* Wednesday, November 08, 2017 10:17 AM > *To:* Chan, Regina [Tech] > *Cc:* Chesnay Schepler; user@flink.apache.org > > *Subject:* Re: Job Manager Configuration > > > > Quick question Regina: Which version of Flink are you running? > > > > Cheers, > Till > > > > On Tue, Nov 7, 2017 at 4:38 PM, Till Rohrmann <till.rohrm...@gmail.com> > wrote: > > Hi Regina, > > > > the user code is uploaded once to the `JobManager` and then downloaded > from each `TaskManager` once when it first receives the command to execute > the first task of your job. > > > > As Chesnay said there is no fundamental limitation to the size of the > Flink job. However, it might be the case that you have configured your job > sub-optimally. You said that you have 300 parallel flows. Depending on > whether you've defined separate slot sharing groups for them or not, it > might be the case that parallel subtasks of all 300 parallel jobs share the > same slot (if you haven't changed the slot sharing group). Depending on > what you calculate, this can be inefficient because the individual tasks > don't get much computation time. Moreover, all tasks will allocate some > objects on the heap which can lead to more GC. Therefore, it might make > sense to group some of the jobs together and run these jobs in batches > after the previous batch completed. But this is hard to say without knowing > the details of your job and getting a glimpse at the JobManager logs. > > > > Concerning the exception you're seeing, it would also be helpful to see > the logs of the client and the JobManager. Actually, the scheduling of the > job is independent of the response. Only the creation of the ExecutionGraph > and making the JobGraph highly available in case of an HA setup are > executed before the JobManager acknowledges the job submission. Only if > this acknowledge message is not received in time on the client side, then > the SubmissionTimeoutException is thrown. Therefore, I assume that somehow > the JobManager is too busy or kept from sending the acknowledge message. > > > > Cheers, > > Till > > > > > > > > On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina <regina.c...@gs.com> wrote: > > Does it copy per TaskManager or per operator? I only gave it 10 > TaskManagers with 2 slots. I’m perfectly fine with it queuing up and > running when it has the resources to. > > > > > > > > *From:* Chesnay Schepler [mailto:ches...@apache.org] > *Sent:* Wednesday, November 01, 2017 7:09 AM > *To:* user@flink.apache.org > *Subject:* Re: Job Manager Configuration > > > > AFAIK there is no theoretical limit on the size of the plan, it just > depends on the available resources. > > > > The job submissions times out since it takes too long to deploy all the > operators that the job defines. With 300 flows, each with 6 operators > you're looking at potentially (1800 * parallelism) tasks that have to be > deployed. For each task Flink copies the user-code of *all* flows to the > executing TaskManager, which the network may just not be handle in time. > > I suggest to split your job into smaller batches or even run each of them > independently. > > On 31.10.2017 16:25, Chan, Regina wrote: > > Asking an additional question, what is the largest plan that the > JobManager can handle? Is there a limit? My flows don’t need to run in > parallel and can run independently. I wanted them to run in one single job > because it’s part of one logical commit on my side. > > > > Thanks, > > Regina > > > > *From:* Chan, Regina [Tech] > *Sent:* Monday, October 30, 2017 3:22 PM > *To:* 'user@flink.apache.org' > *Subject:* Job Manager Configuration > > > > Flink Users, > > > > I have about 300 parallel flows in one job each with 2 inputs, 3 > operators, and 1 sink which makes for a large job. I keep getting the below > timeout exception but I’ve already set it to a 30 minute time out with a > 6GB heap on the JobManager? Is there a heuristic to better configure the > job manager? > > > > Caused by: > org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: > Job submission to the JobManager timed out. You may increase > 'akka.client.timeout' in case the JobManager needs more time to configure > and confirm the job submission. > > > > *Regina Chan* > > *Goldman Sachs** –* Enterprise Platforms, Data Architecture > > *30 Hudson Street, 37th floor | Jersey City, NY 07302 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__maps.google.com_-3Fq-3D30-2BHudson-2BStreet-2C-2B37th-2Bfloor-2B-257C-2BJersey-2BCity-2C-2BNY-2B07302-26entry-3Dgmail-26source-3Dg&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=laVSZTJAQISd6BKl5JXEilWYowD61y4Ai_UMr2jf_9c&s=X1OLt2bSLUDeiuNf2MdsX_68SjaV87OwfR1puLmsKlc&e=>* > ( (212) 902-5697 > > > > > > > > >