Quick question Regina: Which version of Flink are you running? Cheers, Till
On Tue, Nov 7, 2017 at 4:38 PM, Till Rohrmann <till.rohrm...@gmail.com> wrote: > Hi Regina, > > the user code is uploaded once to the `JobManager` and then downloaded > from each `TaskManager` once when it first receives the command to execute > the first task of your job. > > As Chesnay said there is no fundamental limitation to the size of the > Flink job. However, it might be the case that you have configured your job > sub-optimally. You said that you have 300 parallel flows. Depending on > whether you've defined separate slot sharing groups for them or not, it > might be the case that parallel subtasks of all 300 parallel jobs share the > same slot (if you haven't changed the slot sharing group). Depending on > what you calculate, this can be inefficient because the individual tasks > don't get much computation time. Moreover, all tasks will allocate some > objects on the heap which can lead to more GC. Therefore, it might make > sense to group some of the jobs together and run these jobs in batches > after the previous batch completed. But this is hard to say without knowing > the details of your job and getting a glimpse at the JobManager logs. > > Concerning the exception you're seeing, it would also be helpful to see > the logs of the client and the JobManager. Actually, the scheduling of the > job is independent of the response. Only the creation of the ExecutionGraph > and making the JobGraph highly available in case of an HA setup are > executed before the JobManager acknowledges the job submission. Only if > this acknowledge message is not received in time on the client side, then > the SubmissionTimeoutException is thrown. Therefore, I assume that somehow > the JobManager is too busy or kept from sending the acknowledge message. > > Cheers, > Till > > > > On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina <regina.c...@gs.com> wrote: > >> Does it copy per TaskManager or per operator? I only gave it 10 >> TaskManagers with 2 slots. I’m perfectly fine with it queuing up and >> running when it has the resources to. >> >> >> >> >> >> >> >> *From:* Chesnay Schepler [mailto:ches...@apache.org] >> *Sent:* Wednesday, November 01, 2017 7:09 AM >> *To:* user@flink.apache.org >> *Subject:* Re: Job Manager Configuration >> >> >> >> AFAIK there is no theoretical limit on the size of the plan, it just >> depends on the available resources. >> >> >> The job submissions times out since it takes too long to deploy all the >> operators that the job defines. With 300 flows, each with 6 operators >> you're looking at potentially (1800 * parallelism) tasks that have to be >> deployed. For each task Flink copies the user-code of *all* flows to the >> executing TaskManager, which the network may just not be handle in time. >> >> I suggest to split your job into smaller batches or even run each of them >> independently. >> >> On 31.10.2017 16:25, Chan, Regina wrote: >> >> Asking an additional question, what is the largest plan that the >> JobManager can handle? Is there a limit? My flows don’t need to run in >> parallel and can run independently. I wanted them to run in one single job >> because it’s part of one logical commit on my side. >> >> >> >> Thanks, >> >> Regina >> >> >> >> *From:* Chan, Regina [Tech] >> *Sent:* Monday, October 30, 2017 3:22 PM >> *To:* 'user@flink.apache.org' >> *Subject:* Job Manager Configuration >> >> >> >> Flink Users, >> >> >> >> I have about 300 parallel flows in one job each with 2 inputs, 3 >> operators, and 1 sink which makes for a large job. I keep getting the below >> timeout exception but I’ve already set it to a 30 minute time out with a >> 6GB heap on the JobManager? Is there a heuristic to better configure the >> job manager? >> >> >> >> Caused by: >> org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: >> Job submission to the JobManager timed out. You may increase >> 'akka.client.timeout' in case the JobManager needs more time to configure >> and confirm the job submission. >> >> >> >> *Regina Chan* >> >> *Goldman Sachs* *–* Enterprise Platforms, Data Architecture >> >> *30 Hudson Street, 37th floor | Jersey City, NY 07302 >> <https://maps.google.com/?q=30+Hudson+Street,+37th+floor+%7C+Jersey+City,+NY+07302&entry=gmail&source=g>* >> ( (212) 902-5697 >> >> >> >> >> > >