RE: [DISCUSS] Semantic and implementation of per-job mode

2019-10-31 Thread Chan, Regina
Yeah just chiming in this conversation as well. We heavily use multiple job graphs to get isolation around retry logic and resource allocation across the job graphs. Putting all these parallel flows into a single graph would mean sharing of TaskManagers across what was meant to be truly independ

RE: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-25 Thread Chan, Regina
4 INFO org.apache.flink.yarn.YarnResourceManager - Received new container: container_e22_1571837093169_78279_01_000947 - Remaining pending container requests: 0 2019-10-25 09:55:51,514 INFO org.apache.flink.yarn.YarnResourceManager - From: Chan, Regina [Engineering] Sent: Wednesday, Octo

RE: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-23 Thread Chan, Regina
AM To: Yang Wang Cc: Chan, Regina [Engineering] ; user@flink.apache.org Subject: Re: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged Hi Regina, When using the FLIP-6 mode, you can control how long it takes for an

RE: Lost JobManager

2018-05-08 Thread Chan, Regina
There’s no collect() explicitly from me. It has a cogroup operator before writing to DataSink. From: Fabian Hueske [mailto:fhue...@gmail.com] Sent: Monday, May 07, 2018 6:31 AM To: Chan, Regina [Tech] Cc: user@flink.apache.org; Newport, Billy [Tech] Subject: Re: Lost JobManager Hi Regina, I

RE: Fat jar fails deployment (streaming job too large)

2018-04-30 Thread Chan, Regina
Any updates on this one? I'm seeing similar issues with 1.3.3 and the batch api. Main difference is that I have even more operators ~850, mostly maps and filters with one cogroup. I don't really want to set a akka.client.timeout for anything more than 10 minutes seeing that it still fails with

RE: ProgramInvocationException: Could not upload the jar files to the job manager / No space left on device

2017-12-11 Thread Chan, Regina
ser/delp/.flink/application_1510733430616_2098853/log4j.properties From: Chan, Regina [Tech] Sent: Tuesday, December 12, 2017 1:56 AM To: 'user@flink.apache.org' Subject: ProgramInvocationException: Could not upload the jar files to the job manager / No space left on device Hi, I'm currently su

ProgramInvocationException: Could not upload the jar files to the job manager / No space left on device

2017-12-11 Thread Chan, Regina
Hi, I'm currently submitting 50 separate jobs to a 50TM, 1 slot set up. Each job has 1 parallelism. There's plenty of space left in my cluster and on that node. It's not clear to me what's happening. Any pointers? On the client side, when I try to execute, I see the following: org.apache.flink.

Flink 1.2.0->1.3.2 TaskManager reporting to JobManager

2017-11-27 Thread Chan, Regina
Hi, As I moved from Flink 1.2.0 to 1.3.2 I noticed that the TaskManager may have all tasks with FINISHED but then take about 2-3 minutes before the Job execution switches to FINISHED. What is it doing that's taking this long? This was a parallelism = 1 case... Regina Chan Goldman Sachs - Enter

Avoiding Dynamic Classloading

2017-11-20 Thread Chan, Regina
Hi, I was reading that I should avoid using dynamic classloading and so copy the job's jar into the /lib directory (RE: below) 1. How can I confirm that the jar was copied over? I only see the following below: 2017-11-20 15:36:52,724 INFO org.apache.flink.yarn.Utils

RE: Job Manager Configuration

2017-11-18 Thread Chan, Regina
To: Chan, Regina [Tech] Cc: user@flink.apache.org Subject: Re: Job Manager Configuration I have an IO-dominated batch job with 471 distinct tasks (3786 tasks with parallelism) running on 8 nodes with 12 GiB of memory and 4 CPUs each. I haven’t had any problems adding additional tasks except for 1

RE: Job Manager Configuration

2017-11-08 Thread Chan, Regina
JobManager doing that’s keeping it busy? It’s the same code across the TaskManagers. I’ll get you the logs shortly. From: Till Rohrmann [mailto:trohrm...@apache.org] Sent: Wednesday, November 08, 2017 10:17 AM To: Chan, Regina [Tech] Cc: Chesnay Schepler; user@flink.apache.org Subject: Re: Job

RE: Job Manager Configuration

2017-11-02 Thread Chan, Regina
ependently. On 31.10.2017 16:25, Chan, Regina wrote: Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don't need to run in parallel and can run independently. I wanted them to run in one single job because it's part of one

RE: Job Manager Configuration

2017-10-31 Thread Chan, Regina
Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don't need to run in parallel and can run independently. I wanted them to run in one single job because it's part of one logical commit on my side. Thanks, Regina

Job Manager Configuration

2017-10-30 Thread Chan, Regina
Flink Users, I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I've already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure

Impersonation support in Flink

2017-10-23 Thread Chan, Regina
Hi folks, Is Flink is able to do impersonation using UserGroupInformation? How do we make all the tasks run with this in a way that we wouldn't have to do it per task? UserGroupInformation ugi = UserGroupInformation.createProxyUser( proxyUser, UserGroupInformation.getLoginUser()); PrivilegedEx

Flink Yarn Session failures

2017-08-28 Thread Chan, Regina
Hi, Was trying to understand why it takes about 9 minutes between the last try to start a container and when it finally gets the sigterm to kill the YarnApplicationMasterRunner. Client: Calc Engine: 2017-08-28 12:39:23,596 INFO org.apache.flink.yarn.YarnClusterClient