Re: Operator latency metric not working in 1.9.1

2020-03-03 Thread Gary Yao
Hi, There is a release note for Flink 1.7 that could be relevant for you [1] Granularity of latency metrics The default granularity for latency metrics has been modified. To restore the previous behavior users have to explicitly set the granularity to subtask. Best, Gary [1] https://ci.

Re: Unable to recover from savepoint and checkpoint

2020-03-03 Thread Gary Yao
Hi Puneet, Can you describe how you validated that the state is not restored properly? Specifically, how did you introduce faults to the cluster? Best, Gary On Tue, Mar 3, 2020 at 11:08 AM Puneet Kinra < puneet.ki...@customercentria.com> wrote: > Sorry for the missed information > > On recovery

Re: Alink and Flink ML

2020-03-03 Thread Gary Yao
Hi Flavio, I am looping in Becket (cc'ed) who might be able to answer your question. Best, Gary On Tue, Mar 3, 2020 at 12:19 PM Flavio Pompermaier wrote: > Hi to all, > since Alink has been open sourced, is there any good reason to keep both > Flink ML and Alink? > From what I understood Alink

Re: java.util.concurrent.ExecutionException

2020-03-03 Thread Gary Yao
Hi, Can you post the complete stacktrace? Best, Gary On Tue, Mar 3, 2020 at 1:08 PM kant kodali wrote: > Hi All, > > I am just trying to read edges which has the following format in Kafka > > 1,2 > 1,3 > 1,5 > > using the Table API and then converting to DataStream of Edge Objects and > printi

Re: java.util.concurrent.ExecutionException

2020-03-03 Thread Gary Yao
ly(CaseStatements.scala:21) >> >> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >> >> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >> >> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >> >

Re: Failure detection and Heartbeats

2020-03-11 Thread Gary Yao
Hi Morgan, > I am interested in knowing more about the failure detection mechanism used by Flink, unfortunately information is a little thin on the ground and I was hoping someone could shed a little light on the topic. It is probably best to look into the implementation (see my answers below). >

Re: Automatically Clearing Temporary Directories

2020-03-12 Thread Gary Yao
Hi David, > Would it be safe to automatically clear the temporary storage every time when a TaskManager is started? > (Note: the temporary volumes in use are dedicated to the TaskManager and not shared :-) Yes, it is safe in your case. Best, Gary On Tue, Mar 10, 2020 at 6:39 PM David Maddison w

Re: Testing RichAsyncFunction with TestHarness

2020-03-30 Thread Gary Yao
> > Additionally even though I add all necessary dependencies defiend in [1] I > cannot see ProcessFunctionTestHarnesses class. > That class was added in Flink 1.10 [1]. [1] https://github.com/apache/flink/blame/f765ad09ae2b2aa478c887b988e11e92a8b730bd/flink-streaming-java/src/test/java/org/apach

Re: Run several jobs in parallel in same EMR cluster?

2020-03-30 Thread Gary Yao
Can you try to set config option taskmanager.numberOfTaskSlots to 2? By default the TMs only offer one slot [1] independent from the number of CPU cores. Best, Gary [1] https://github.com/apache/flink/blob/da3082764117841d885f41c645961f8993a331a0/flink-core/src/main/java/org/apache/flink/configur

Re: Two questions about Async

2020-04-22 Thread Gary Yao
> Bytes Sent but Records Sent is always 0 Sounds like a bug. However, I am unable to reproduce this using the AsyncIOExample [1]. Can you provide a minimal working example? > Is there an Async Sink? Or do I just rewrite my Sink as an AsyncFunction followed by a dummy sink? You will have to impl

Re: Running 2 instances of LocalExecutionEnvironments with same consumer group.id

2020-04-22 Thread Gary Yao
Hi Suraj, This question has been asked before: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-kafka-consumers-don-t-honor-group-id-td21054.html Best, Gary On Wed, Apr 22, 2020 at 6:08 PM Suraj Puvvada wrote: > > Hello, > > I have two JVMs that run LocalExecution

Re: ProducerRecord with Kafka Sink for 1.8.0

2020-05-12 Thread Gary Yao
Hi Nick, Are you able to upgrade to Flink 1.9? Beginning with Flink 1.9 you can use KafkaSerializationSchema to produce a ProducerRecord [1][2]. Best, Gary [1] https://issues.apache.org/jira/browse/FLINK-11693 [2] https://ci.apache.org/projects/flink/flink-docs-release-1.9/api/java/org/apache/f

Re: Flink Metrics in kubernetes

2020-05-12 Thread Gary Yao
Hi Averell, If you are seeing the log message from [1] and Scheduled#report() is not called, the thread in the "Flink-MetricRegistry" thread pool might be blocked. You can use the jstack utility to see on which task the thread pool is blocked. Best, Gary [1] https://github.com/apache/flink/blob

Re: ProducerRecord with Kafka Sink for 1.8.0

2020-05-12 Thread Gary Yao
to me so quickly. > > Best, > Nick > > On Tue, May 12, 2020 at 3:37 AM Gary Yao wrote: >> >> Hi Nick, >> >> Are you able to upgrade to Flink 1.9? Beginning with Flink 1.9 you can use >> KafkaSerializationSchema to produce a ProducerRecord [1][2]. >>

Re: ProducerRecord with Kafka Sink for 1.8.0

2020-05-14 Thread Gary Yao
ctory in flink the artifact is flink-core-1.7.2.jar . I need to pack flink-core-1.9.0.jar from application and use child first class loading to use newer version of flink-core. If I have it as provided scope, sure it will work in IntelliJ but not outside of it . > > Best, > Nick > >

Re: Question on Job Restart strategy

2020-05-26 Thread Gary Yao
Hi Bhaskar, > Why the reset counter is not zero after streaming job restart is successful? The short answer is that the fixed delay restart strategy is not implemented like that (see [1] if you are using Flink 1.10 or above). There are also other systems that behave similarly, e.g., Apache Hadoop

Re: [ANNOUNCE] Andrey Zagrebin becomes a Flink committer

2019-08-14 Thread Gary Yao
Congratulations Andrey, well deserved! Best, Gary On Thu, Aug 15, 2019 at 7:50 AM Bowen Li wrote: > Congratulations Andrey! > > On Wed, Aug 14, 2019 at 10:18 PM Rong Rong wrote: > >> Congratulations Andrey! >> >> On Wed, Aug 14, 2019 at 10:14 PM chaojianok wrote: >> >> > Congratulations Andre

Re: How do I start a Flink application on my Flink+Mesos cluster?

2019-09-11 Thread Gary Yao
Hi Felipe, I am glad that you were able to fix the problem yourself. > But I suppose that Mesos will allocate Slots and Task Managers dynamically. > Is that right? Yes, that is the case since Flink 1.5 [1]. > Then I put the number of "mesos.resourcemanager.tasks.cpus" to be equal or > less the

Re: error in Elasticsearch

2019-09-11 Thread Gary Yao
Hi, You are not supposed to change that part of the exercise code. You have to pass the path to the input file as a program argument (e.g., --input /path/to/file). See [1] and [2] on how to configure program arguments in IntelliJ. Best, Gary [1] https://www.jetbrains.com/help/idea/run-debug-conf

Re: error in Elasticsearch

2019-09-11 Thread Gary Yao
Program arguments should be set to "--input /home/alaa/nycTaxiRides.gz" (without the quotes). On Wed, Sep 11, 2019 at 10:39 AM alaa wrote: > Hallo > > I put arguments but the same error appear .. what should i do ? > > > < > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/fi

Re: Multiple Job Managers in Flink HA Setup

2019-09-25 Thread Gary Yao
Hi Steve, > I also tried attaching a shared NFS folder between the two machines and > tried to set their web.tmpdir property to the shared folder, however it > appears that each job manager creates a seperate job inside that directory. You can create a fixed upload directory via the config option

[ANNOUNCE] Progress of Apache Flink 1.10 #2

2019-11-01 Thread Gary Yao
Hi community, Because we have approximately one month of development time left until the targeted Flink 1.10 feature freeze, we thought now would be a good time to give another progress update. Below we have included a list of the ongoing efforts that have made progress since our last release prog

Re: Localenvironment jobcluster ha High availability

2019-12-11 Thread Gary Yao
Hi Eric, What you say should be possible because your job will be executed in a MiniCluster [1] which has HA support. I have not tried this out myself, and I am not aware that people are doing this in production. However, there are integration tests that use MiniCluster + ZooKeeper [2]. Best, Gar

Re: REST rescale with Flink on YARN

2020-01-28 Thread Gary Yao
Hi, You can use yarn application -status to find the host and port that the server is listening on (AM host & RPC Port). If you need to access that information programmatically, take a look at the YarnClient [1]. Best, Gary [1] https://hadoop.apache.org/docs/r2.8.5/api/org/apache/hadoop/

Re: yarn session: one JVM per task

2020-02-25 Thread Gary Yao
Hi David, Before with the both n and -s it was not the case. > What do you mean by before? At least in 1.8 "-s" could be used to specify the number of slots per TM. how can I be sure that my Sink that uses this lib is in one JVM ? > Is it enough that no other parallel instance of your sink run

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

2018-06-26 Thread Gary Yao
Hi Vishal, Could it be that you are not using the 1.5.0 client? The stacktrace you posted does not reference valid lines of code in the release-1.5.0-rc6 tag. If you have a HA setup, the host and port of the leading JM will be looked up from ZooKeeper before job submission. Therefore, the flink-c

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

2018-06-26 Thread Gary Yao
ea07 >> >> 2018-06-26 13:32:01 INFO ZooKeeper:684 - Session: 0x35add547801ea07 >> closed >> >> 2018-06-26 13:32:01 INFO ClientCnxn:519 - EventThread shut down for >> session: 0x35add547801ea07 >> >> 2018-06-26 13:32:01 DEBUG ClientCnxn:1146 - An exception was t

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

2018-06-28 Thread Gary Yao
Hi Vishal, The znode /flink_test/da_15/leader/rest_server_lock should exist as long as your Flink 1.5 cluster is running. In 1.4 this znode will not be created. Are you sure that the znode does not exist? Unfortunately you only attached the output of "ls /flink_test/da_15". Can you share the comp

Re: CoreOptions.TMP_DIRS bug

2018-07-04 Thread Gary Yao
Hi Oleksandr, I think your conclusions are correct. Thank you for looking into it. You can open a JIRA ticket describing the issue. Best, Gary On Wed, Jul 4, 2018 at 9:30 AM, Oleksandr Nitavskyi wrote: > Hello guys, > > We have discovered minor issue with Flink 1.5 on YARN particularly which >

Re: Flink resource manager unable to connect to mesos after restart

2018-07-18 Thread Gary Yao
Hi, If you are able to re-produce this reliably, can you post the jobmanager logs? Best, Gary On Wed, Jul 18, 2018 at 10:33 AM, Renjie Liu wrote: > Hi, all: > > I'm testing flink 1.5.0 and find that flink mesos resource manager unable > to connect to mesos after restart. Have you seen this hap

Re: Cannot configure akka.ask.timeout

2018-07-19 Thread Gary Yao
Hi Lukas, It seems that when using MiniCluster, the config key akka.ask.timeout is not respected. Instead, a hardcoded timeout of 10s is used [1]. Since all communication is locally, it would be interesting to see in detail what your job looks like that it exceeds the timeout. The key akka.ask.ti

Re: 1.5.1

2018-07-22 Thread Gary Yao
Hi, The first exception should be only logged on info level. It's expected to see this exception when a TaskManager unregisters from the ResourceManager. Heartbeats can be configured via heartbeat.interval and hearbeat.timeout [1]. The default timeout is 50s, which should be a generous value. It

Re: Flink resource manager unable to connect to mesos after restart

2018-07-26 Thread Gary Yao
just need to kill job manager and restart >> it. >> >> Attached is jobmanager's log, but I don't find anyting valuable since it >> just keep reporting unable to connect to mesos master. >> >> On Thu, Jul 19, 2018 at 4:55 AM Gary Yao wrote: >> >

Re: Delay in REST/UI readiness during JM recovery

2018-08-01 Thread Gary Yao
Hi Joey, If the other components (e.g., Dispatcher, ResourceManager) are able to finish the leader election in a timely manner, I currently do not see a reason why it should take the REST server 20 - 45 minutes. You can check the contents of znode /flink/.../leader/rest_server_lock to see if ther

Re: HA setting for per-job YARN session

2018-08-05 Thread Gary Yao
Hi, The per-job YARN mode also supports HA. Even if the YARN ResourceManager only brings up one JobManager at a time, you require ZooKeeper for storing metadata, such as which jobs are supposed to be running. In fact, the distributed components will still take part in a leader election with only o

Re: Flink Forwards 2018 videos

2018-08-06 Thread Gary Yao
Hi Elias, If you are using an ad blocker, can you turn it off and try it again? I have forwarded your email internally at data Artisans, and we are working on a proper solution. Best, Gary On Sun, Aug 5, 2018 at 8:30 PM, Elias Levy wrote: > It appears the Flink Forwards 2018 videos are FUBAR.

Re: connection failed when running flink in a cluster

2018-08-06 Thread Gary Yao
Hi, nc exits after the first connection is closed. Are you re-running the nc command every time the job finishes? The stacktrace you copied does not indicate that a TaskManager cannot connect to the JobManager. I can only see that the SocketTextStreamFunction (from the SocketWindowWordCount job?)

Re: connection failed when running flink in a cluster

2018-08-06 Thread Gary Yao
nal dependencies that I didn`t >>> include in my deploy. Do you have any clue? >>> >>> I am following the original quickstart (https://ci.apache.org/ >>> projects/flink/flink-docs-master/quickstart/setup_quickstart.html) >>> >>> Kind Regards, >&

Re: Could not build the program from JAR file.

2018-08-06 Thread Gary Yao
Hi Florian, You write that Flink 1.4.2 works but what version is not working for you? Best, Gary On Tue, Aug 7, 2018 at 8:25 AM, Florian Simond wrote: > Hi all, > > > I'm trying to run the wordCount example on my YARN cluster and this is not > working.. I get the error message specified in ti

Re: Could not build the program from JAR file.

2018-08-07 Thread Gary Yao
(CliFrontend. > java:1101) > Caused by: java.io.FileNotFoundException: JAR file does not exist: -yn > at org.apache.flink.client.cli.CliFrontend.buildProgram( > CliFrontend.java:828) > at org.apache.flink.client.cli.CliFrontend.run(CliFrontend. > java:205) >

Re: Could not build the program from JAR file.

2018-08-07 Thread Gary Yao
4.2, that's why I missed that > part... > > > Do you have any pointer about the dynamic number of TaskManagers ? I'm > curious to know how it works. Is it possible to still fix it ? > > > Thank you, > > Florian > > > -

Re: Could not build the program from JAR file.

2018-08-07 Thread Gary Yao
everal jobs, it could be harder to make sure they do not > interfere with each other... > > > -- > *De :* Gary Yao > *Envoyé :* mardi 7 août 2018 12:27 > > *À :* Florian Simond > *Cc :* vino yang; user@flink.apache.org > *Objet :* Re: Coul

Re: Could not build the program from JAR file.

2018-08-07 Thread Gary Yao
k longer to start (10 minutes) > > And completed on this line: > > 2018-08-07 14:31:11,852 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: > Custom Source -> Sink: Unnamed (1/1) (655509c673d8ae19aac195276ad2c3e6) > switched from DEPLOYING to RUNNING. > > > > Thanks a lot for yo

Re: Could not build the program from JAR file.

2018-08-08 Thread Gary Yao
to be correct afterall, I see " > 1a9b648" there too: https://github.com/apache/flink/releases > > > But I don't know why it's written version 0.1... > -- > *De :* Gary Yao > *Envoyé :* mardi 7 août 2018 19:30 > *À :* Florian

Re: Flink REST api for cancel with savepoint on yarn

2018-08-14 Thread Gary Yao
Hi Vipul, We are aware of YARN-2031. There are some ideas how to workaround it, which are tracked here: https://issues.apache.org/jira/browse/FLINK-9478 At the moment you have the following options: 1. Find out the master's address from ZooKeeper [1] and issue the HTTP request against t

Re: 1.5.1

2018-08-14 Thread Gary Yao
g them equitably. >> TMs selectively are under more stress then in a pure RR distribution I >> think. We may have to lower the slots on each TM to define a good upper >> bound. You are correct 50s is a a pretty generous value. >> >> On Sun, Jul 22, 2018 at 6:55 AM,

Re: 1.5.1

2018-08-15 Thread Gary Yao
his was repeated until all restart attempts had been used (we've set it > to 50), and then the job finally failed. > > I would like to know also how to prevent Flink from going into such bad > state. At least it should exit immediately instead of retrying in such a > situation. A

Re: Flink not rolling log files

2018-08-17 Thread Gary Yao
Hello Navneet Kumar Pandey, org.apache.log4j.rolling.RollingFileAppender is part of Apache Extras Companion for Apache log4j [1]. Is that library in your classpath? Are there hints in taskmanager.err? Can you run: cat /usr/lib/flink/conf/log4j.properties on the EMR master node and show the

Re: Flink not rolling log files

2018-08-18 Thread Gary Yao
ink-queryable-state-runtime_2.11-1.4.2.jar log4j-1.2.17.jar > flink-dist_2.11-1.4.2.jar flink-python_2.11-1.4.2.jar > flink-shaded-hadoop2-uber-1.4.2.jar slf4j-log4j12-1.7.7.jar > > On Fri, Aug 17, 2018 at 5:15 PM, Gary Yao wrote: > >> Hello Navneet Kumar P

Re: getting error while running flink job on yarn cluster

2018-08-21 Thread Gary Yao
Hi Yubraj Singh, Can you try submitting with HADOOP_CLASSPATH=`hadoop classpath` set? [1] For example: HADOOP_CLASSPATH=`hadoop classpath` bin/flink run [...] Best, Gary [1] https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/hadoop.html#configuring-flink-with-hadoop-classpat

Re: akka.ask.timeout setting not honored

2018-08-31 Thread Gary Yao
Hi Greg, Can you describe the steps to reproduce the problem, or can you attach the full jobmanager logs? Because JobExecutionResultHandler appears in your log, I assume that you are starting a job cluster on YARN. Without seeing the complete logs, I cannot be sure what exactly happens. For now, y

Re: akka.ask.timeout setting not honored

2018-08-31 Thread Gary Yao
setting >> and reply again if the problem comes up again. Thanks again for your quick >> response! >> >> On Fri, Aug 31, 2018 at 1:38 PM Gary Yao wrote: >> >>> Hi Greg, >>> >>> Can you describe the steps to reproduce the problem, or can yo

Re: Flink on Yarn, restart job will not destroy original task manager

2018-09-03 Thread Gary Yao
Hi James, What version of Flink are you running? In 1.5.0, tasks can spread out due to changes that were introduced to support "local recovery". There is a mitigation in 1.5.1 that prevents task spread out but local recovery must be disabled [2]. Best, Gary [1] https://issues.apache.org/jira/bro

Re: Flink on Yarn, restart job will not destroy original task manager

2018-09-03 Thread Gary Yao
In 1.5.0, I’ve not enable local > recovery. > > > > So whether I need manual disable local recovery via flink.conf? > > > > Regards > > > > James > > > > *From: *"James (Jian Wu) [FDS Data Platform]" > *Date: *Monday, September 3, 20

Re: AskTimeoutException when canceling job with savepoint on flink 1.6.0

2018-09-05 Thread Gary Yao
Hi Jelmer, I saw that you have already found the JIRA issue tracking this problem [1] but I will still answer on the mailing list for transparency. The timeout for "cancel with savepoint" should be RpcUtils.INF_TIMEOUT. Unfortunately Flink is currently not respecting this timeout. A pull request

Re: flink list and flink run commands timeout

2018-09-05 Thread Gary Yao
Hi Jason, >From the stacktrace it seems that you are using the 1.4.0 client to list jobs on a 1.5.x cluster. This will not work. You have to use the 1.5.x client. Best, Gary On Wed, Sep 5, 2018 at 5:35 PM, Jason Kania wrote: > Hello, > > Thanks for the response. I had already tried setting the

Re: Setting Flink Monitoring API Port on YARN Cluster

2018-09-06 Thread Gary Yao
Hi Austin, The config options rest.port, jobmanager.web.port, etc. are intentionally ignored on YARN. The port should be chosen randomly to avoid conflicts with other containers [1]. I do not see a way how you can set a fixed port at the moment but there is a related ticket for that [2]. The Flink

Re: Cancel flink job occur exception

2018-09-08 Thread Gary Yao
Hi all, The question is being handled on the dev mailing list [1]. Best, Gary [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Cancel-flink-job-occur-exception-td24056.html On Tue, Sep 4, 2018 at 2:21 PM, rileyli(李瑞亮) wrote: > Hi all, > I submit a flink job through yar

Re: Question about akka configuration for FLIP-6

2018-09-09 Thread Gary Yao
Hi Tony, You are right that with FLIP-6 Akka is abstracted away. If you want custom heartbeat settings, you can configure the options below [1]: heatbeat.interval heartbeat.timeout The config option taskmanager.exit-on-fatal-akka-error is also not relevant anymore. I closest I can think

Re: Question about akka configuration for FLIP-6

2018-09-09 Thread Gary Yao
nation-via-akka On Mon, Sep 10, 2018 at 7:38 AM, Gary Yao wrote: > Hi Tony, > > You are right that with FLIP-6 Akka is abstracted away. If you want custom > heartbeat settings, you can configure the options below [1]: > > heatbeat.interval > heartbeat.timeout > >

Re: Question about akka configuration for FLIP-6

2018-09-09 Thread Gary Yao
LIP-6 mode. > > 1. akka.transport.heartbeat.interval > 2. akka.transport.heartbeat.pause > > It seems they are different from HeartbeatServices and possibly still > valid. > > Best, > tison. > > > Gary Yao 于2018年9月10日周一 下午1:50写道: > >> I should add that

Re: JAXB Classloading errors when using PMML Library (AWS EMR Flink 1.4.2)

2018-09-11 Thread Gary Yao
Hi, Do you also have pmml-model-moxy as a dependency in your job? Using mvn dependency:tree, I do not see that pmml-evaluator has a compile time dependency on jaxb-api. The jaxb-api dependency actually comes from pmml- model-moxy. The exclusion should be therefore defined on pmml-model-moxy. You

Re: How to get Cluster metrics in FLIP-6 mode

2018-09-11 Thread Gary Yao
Hi Tony, You are right that these metrics are missing. There is already a ticket for that [1]. At the moment you can obtain these information from the REST API (/overview) [2]. Since FLIP-6, the JM is no longer responsible for these metrics but for backwards compatibility we can leave them in the

Re: Create a file in parquet format

2018-09-11 Thread Gary Yao
Hi Jose, You can find an example here: https://github.com/apache/flink/blob/1a94c2094b8045a717a92e232f9891b23120e0f2/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java#L58 Best, Gary On Tue, Sep 11, 2018 at 11:59 AM, jose farfan

Re: Utilising EMR's master node

2018-09-17 Thread Gary Yao
Hi Averell, According to the AWS documentation [1], the master node only runs the YARN ResourceManager and the HDFS NameNode. Containers can only by launched on nodes that are running the YARN NodeManager [2]. Therefore, if you want TMs or JMs to be launched on your EMR master node, you have to st

Re: Utilising EMR's master node

2018-09-18 Thread Gary Yao
Hi Averell, Flink compares the number of user selected vcores to the vcores configured in the yarn-site.xml of the submitting node, i.e., in your case the master node. If there are not enough configured vcores, the client throws an exception. This behavior is not ideal and I found an old JIRA tick

Re: "405 HTTP method POST is not supported by this URL" is returned when POST a REST url on Flink on yarn

2018-09-25 Thread Gary Yao
Hi Henry, The URL below looks like the one from the YARN proxy (note that "proxy" appears in the URL): http://storage4.test.lan:8089/proxy/application_1532081306755_0124/jobs/269608367e0f548f30d98aa4efa2211e/savepoints You can use yarn application -status to find the host and port of the

Re: Utilising EMR's master node

2018-09-26 Thread Gary Yao
Hi Averell, There is no general answer to your question. If you are running more TMs, you get better isolation between different Flink jobs because one TM is backed by one JVM [1]. However, every TMs brings additional overhead (heartbeating, running more threads, etc.) [1]. It also depends on the

Re: Flink 1.5.2 - excessive ammount of container requests, Received new/Returning excess container "flood"

2018-10-06 Thread Gary Yao
Hi Borys, To debug how many containers Flink is requesting, you can look out for the log statement below [1]: Requesting new TaskExecutor container with resources [...] If you need help debugging, can you attach the full JM logs (preferably on DEBUG level)? Would it be possible for you to te

Re: Utilising EMR's master node

2018-10-06 Thread Gary Yao
Hi Averell, It is up to the YARN scheduler on which hosts the containers are started. What Flink version are you using? I assume you are using 1.4 or earlier because you are specifying a fixed number of TMs. If you launch Flink with -yn 2, you should be only seeing 2 TMs in total (not 4). Are you

Re: Job manager logs for previous YARN attempts

2018-10-10 Thread Gary Yao
Hi Pawel, As far as I know, the application attempt is incremented if the application master fails and a new one is brought up. Therefore, what you are seeing should not happen. I have just deployed on AWS EMR 5.17.0 (Hadoop 2.8.4) and killed the container running the application master – the cont

Re: Flink 1.5.2 - excessive ammount of container requests, Received new/Returning excess container "flood"

2018-10-10 Thread Gary Yao
Hi Borys, I remember that another user reported a similar issue recently [1] – attached to the ticket you can find his log file. If I recall correctly, we concluded that YARN returned the containers very quickly. At the time, Flink's debug level logs were inconclusive because we did not log the re

Re: User jar is present in the flink job manager's class path

2018-10-11 Thread Gary Yao
Hi, Could it be that you are submitting the job in attached mode, i.e., without -d parameter? In the "job cluster attached mode", we actually start a Flink session cluster (and stop it again from the CLI) [1]. Therefore, in attached mode, the config option "yarn.per-job-cluster.include-user-jar" i

Re: Never gets into ProcessWindowFunction.process()

2018-11-09 Thread Gary Yao
Hi, If the job is actually running and consuming from Kinesis, the log you posted is unrelated to your problem. To understand why the process function is not invoked, we would need to see more of your code, or you would need to provide an executable example. The log only shows that all offered slo

Re: Never gets into ProcessWindowFunction.process()

2018-11-09 Thread Gary Yao
Hi, You are using event time but are you assigning watermarks [1]? I do not see it in the code. Best, Gary [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kinesis.html#event-time-for-consumed-records On Fri, Nov 9, 2018 at 6:58 PM Vijay Balakrishnan wrote: > Hi, > An

Re: Per job cluster doesn't shut down after the job is canceled

2018-11-09 Thread Gary Yao
Hi Paul, Can you share the complete logs, or at least the logs after invoking the cancel command? If you want to debug it yourself, check if MiniDispatcher#jobReachedGloballyTerminalState [1] is invoked, and see how the jobTerminationFuture is used. Best, Gary [1] https://github.com/apache/flin

Re: Flink Web UI does not show specific exception messages when job submission fails

2018-11-09 Thread Gary Yao
Hi, We only propagate the exception message but not the complete stacktrace [1]. Can you create a ticket for that? Best, Gary [1] https://github.com/apache/flink/blob/091cff3299aed4bb143619324f6ec8165348d3ae/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/AbstractRestHandler.ja

Re: Any examples on invoke the Flink REST API post method ?

2018-11-12 Thread Gary Yao
Hi Henry, What you see in the API documentation is a schema definition and not a sample request. The request body should be: { "target-directory": "hdfs:///flinkDsl", "cancel-job": false } Let me know if that helps. Best, Gary On Mon, Nov 12, 2018 at 7:15 AM vino yang

Re: Per job cluster doesn't shut down after the job is canceled

2018-11-20 Thread Gary Yao
commands. > > And also thank you for the pointer, I’ll keep tracking this problem. > > Best, > Paul Lam > > > 在 2018年11月10日,02:10,Gary Yao 写道: > > Hi Paul, > > Can you share the complete logs, or at least the logs after invoking the > cancel command? >

Re: Flink issue while setting up in Kubernetes

2018-12-10 Thread Gary Yao
Hi Abhi Thakur, We need more information to help you. What docker images are you using? Can you share the kubernetes resource definitions? Can you share the complete logs of the JM and TMs? Did you follow the steps outlined in the Flink documentation [1]? Best, Gary [1] https://ci.apache.org/pro

Re: How to shut down Flink Web Dashboard in detached Yarn session?

2018-12-31 Thread Gary Yao
Hi, You can use the YARN client to list all applications on your YARN cluster: yarn application -list If this does not show any running applications, the Flink cluster must have somehow terminated. If you have YARN's log aggregation enabled, you should be able to view the Flink logs by runni

Re: [Flink 1.7.0] always got 10s ask timeout exception when submitting job with checkpoint via REST

2019-01-10 Thread Gary Yao
Hi all, I think increasing the default value of the config option web.timeout [1] is what you are looking for. Best, Gary [1] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/RestHandlerConfiguration.java#L76 [2] https://github.com/apa

Re: Building Flink from source according to vendor-specific versionbut causes protobuf conflict

2019-01-11 Thread Gary Yao
Hi Wei, Did you build Flink with maven 3.2.5 as recommended in the documentation [1]? Also, did you use the -Pvendor-repos flag to add the cloudera repository when building? Best, Gary [1] https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink [2] https://

Re: Get the savepointPath of a particular savepoint

2019-01-14 Thread Gary Yao
Hi, The API still returns the location of a completed savepoint. See the example in the Javadoc [1]. Best, Gary [1] https://github.com/apache/flink/blob/1325599153b162fc85679589cab0c2691bf398f1/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/savepoints/SavepointHandlers.jav

Re: Flink 1.7.1 job is stuck in running state

2019-01-18 Thread Gary Yao
Hi Piotr, What was the version you were using before 1.7.1? How do you deploy your cluster, e.g., YARN, standalone? Can you attach full TM and JM logs? Best, Gary On Fri, Jan 18, 2019 at 3:05 PM Piotr Szczepanek wrote: > Hello, > we have scenario with running Data Processing jobs that generate

Re: Flink 1.7.1 job is stuck in running state

2019-01-18 Thread Gary Yao
ld you like to have log entries with DEBUG enabled or > INFO would be enough? > > Thanks, > Piotr > > pt., 18 sty 2019 o 15:14 Gary Yao napisał(a): > >> Hi Piotr, >> >> What was the version you were using before 1.7.1? >> How do you deploy your cluster, e.g.,

Re: ElasticSearch RestClient throws NoSuchMethodError due to shade mechanism

2019-01-18 Thread Gary Yao
Hi Henry, Can you share your pom.xml and the full stacktrace with us? It is expected behavior that org.elasticsearch.client.RestClientBuilder is not shaded. That class comes from the elasticsearch Java client, and we only shade its transitive dependencies. Could it be that you have a dependency in

Re: No resource available error while testing HA

2019-01-23 Thread Gary Yao
Hi Averell, What Flink version are you using? Can you attach the full logs from JM and TMs? Since Flink 1.5, the -n parameter (number of taskmanagers) should be omitted unless you are in legacy mode [1]. > As per that screenshot, it looks like there are 2 tasks manager still > running (one on eac

Re: No resource available error while testing HA

2019-01-24 Thread Gary Yao
Hi Averell, > Then I have another question: when JM cannot start/connect to the JM on .88, > why didn't it try on .82 where resource are still available? When you are deploying on YARN, the TM container placement is decided by the YARN scheduler and not by Flink. Without seeing the complete logs,

Re: No resource available error while testing HA

2019-01-29 Thread Gary Yao
Hi Averell, > Is there any way to avoid this? As if I run this as an AWS EMR job, the job > would be considered failed, while it is actually be restored automatically by > YARN after 10 minutes). You are writing that it takes YARN 10 minutes to restart the application master (AM). However, in my

Re: Issue setting up Flink in Kubernetes

2019-01-29 Thread Gary Yao
Hi Tim, There is an end-to-end test in the Flink repository that starts a job cluster in Kubernetes (minikube) [1]. If that does not help you, can you answer the questions below? What docker images are you using? Can you share the kubernetes resource definitions? Can you share the complete logs o

Re: No resource available error while testing HA

2019-02-06 Thread Gary Yao
Hi Averell, That log file does not look complete. I do not see any INFO level log messages such as [1]. Best, Gary [1] https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L544 On Fri, Feb 1, 2019 a

Re: Flink Standalone cluster - logging problem

2019-02-11 Thread Gary Yao
Hi, Can you define what you mean by "job logs"? For code that is run on the cluster, i.e., JM or TM, you should add your config to log4j.properties. The log4j-cli.properties file is only used by the Flink CLI process. Best, Gary On Mon, Feb 11, 2019 at 7:39 AM simpleusr wrote: > Hi Chesnay, >

Re: No resource available error while testing HA

2019-02-11 Thread Gary Yao
Hi Averell, Logback has this feature [1] but is not enabled out of the box. You will have to enable the JMX agent by setting the com.sun.management.jmxremote system property [2][3]. I have not tried this out, though. Best, Gary [1] https://logback.qos.ch/manual/jmxConfig.html [2] https://docs.or

Re: Flink Standalone cluster - logging problem

2019-02-11 Thread Gary Yao
Hi, Are you logging from your own operator implementations, and you expect these log messages to end up in a file prefixed with XYZ-? If that is the case, modifying log4j-cli.properties will not be expedient as I wrote earlier. You should modify the log4j.properties on all hosts that are running

Re: Flink 1.6 Yarn Session behavior

2019-02-14 Thread Gary Yao
Hi Jins George, This has been asked before [1]. The bottom line is that you currently cannot pre-allocate TMs and distribute your tasks evenly. You might be able to achieve a better distribution across hosts by configuring fewer slots in your TMs. Best, Gary [1] http://apache-flink-user-mailing-

Re: No resource available error while testing HA

2019-02-14 Thread Gary Yao
Hi Averell, The TM containers fetch the Flink binaries and config files form HDFS (or another DFS if configured) [1]. I think you should be able to change the log level by patching the logback configuration in HDFS, and kill all Flink containers on all hosts. If you are running an HA setup, your c

Re: Flink 1.6 Yarn Session behavior

2019-02-17 Thread Gary Yao
machine(8 core), I have 4 nodes, > that will end up 28 taskmanagers and 1 job manager. I was wondering if this > can bring additional burden on jobmanager? Is it recommended? > > Thanks, > > Jins George > On 2/14/19 8:49 AM, Gary Yao wrote: > > Hi Jins George, > >

Re: Each yarn container only use 1 vcore even if taskmanager.numberOfTaskSlots is set

2019-02-17 Thread Gary Yao
Hi Henry, If I understand you correctly, you want YARN to allocate 4 vcores per TM container. You can achieve this by enabling the FairScheduler in YARN [1][2]. Best, Gary [1] https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/config.html#yarn-containers-vcores [2] https://hadoop.ap

Re: Submitting job to Flink on yarn timesout on flip-6 1.5.x

2019-02-21 Thread Gary Yao
Hi, Beginning with Flink 1.7, you cannot use the legacy mode anymore [1][2]. I am currently working on removing references to the legacy mode in the documentation [3]. Is there any reason, you cannot use the "new mode"? Best, Gary [1] https://flink.apache.org/news/2018/11/30/release-1.7.0.html [

  1   2   >