Re: TaskIOMetricGroup metrics not unregistered in prometheus on job failure ?

2018-06-26 Thread jelmer
3659443bfR163> should be removed in notifyOfRemovedMetric Can you confirm this ? --Jelmer On Fri, 15 Jun 2018 at 18:01, Chesnay Schepler wrote: > I remember that another user reported something similar, but he wasn't > using the PrometheusReporter. see > http://apache-flink-user-m

Re: TaskIOMetricGroup metrics not unregistered in prometheus on job failure ?

2018-06-27 Thread jelmer
. > For histograms we will have to modify our HistogramSummaryProxy class to > allow removing individual histograms. > > I've filed FLINK-9665 <https://issues.apache.org/jira/browse/FLINK-9665>. > > On 26.06.2018 17:28, jelmer wrote: > > Hi Chesnay, sorry for the lat

AskTimeoutException when canceling job with savepoint on flink 1.6.0

2018-09-05 Thread jelmer
I am trying to upgrade a job from flink 1.4.2 to 1.6.0 When we do a deploy we cancel the job with a savepoint then deploy the new version of the job from that savepoint. Because our jobs tend to have a lot of state it often takes multiple minutes for our savepoints to complete. On flink 1.4.2 we

Flink 1.5 + resource elasticity resulting in overloaded workers

2018-12-19 Thread jelmer
Hi, We recently upgraded to flink 1.6 and seem to be suffering from the issue described in this email http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Flink-1-5-job-distribution-over-cluster-nodes-tt23364.html Our workers have 8 slots and some workers are fully loaded and as a conse

Re: Flink 1.5 + resource elasticity resulting in overloaded workers

2018-12-21 Thread jelmer
his is not something you would want to do by default ? It's not going to be perfect but at least you don't end always end up in a better spot than you end up now in a standalone setup On Wed, 19 Dec 2018 at 20:26, jelmer wrote: > Hi, We recently upgraded to flink 1.6 and seem to be suf

queryable state and maintaining all time counts

2017-12-29 Thread jelmer
Hi, I've been going through various talks on flink's support for queryable state. Like this talk by Jamie Grier at 2016's Flink forward : https://www.youtube.com/watch?v=uuv-lnOrD0o I see how you can easily use this to produce time series data. Eg calculate the number of events per hour. But I

BucketingSink broken in flink 1.4.0 ?

2018-01-10 Thread jelmer
Hi I am trying to convert some jobs from flink 1.3.2 to flink 1.4.0 But i am running into the issue that the bucketing sink will always try and connect to hdfs://localhost:12345/ instead of the hfds url i have specified in the constructor If i look at the code at https://github.com/apache/flink/

Failed to serialize remote message [class org.apache.flink.runtime.messages.JobManagerMessages$JobFound

2018-01-16 Thread jelmer
HI, We recently upgraded our test environment to from flink 1.3.2 to flink 1.4.0. We are using a high availability setup on the job manager. And now often when I go to the job details in the web ui the call will timeout and the following error will pop up in the job manager log akka.remote.Mess

Re: Failed to serialize remote message [class org.apache.flink.runtime.messages.JobManagerMessages$JobFound

2018-01-16 Thread jelmer
t;> not implement the Serializable interface. Without HA, when browsing the >> web interface, this message is (probably) not serialized since it is >> only served to you via HTML. For HA, this may come from another >> JobManager than the Web interface you are browsing. >>

Starting a job that does not use checkpointing from a savepoint is broken ?

2018-01-18 Thread jelmer
I ran into a rather annoying issue today while upgrading a flink jobs from flink 1.3.2 to 1.4.0 This particular job does not use checkpointing not state. I followed the instructions at https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/upgrading.html First created a savepoint, upgr

Re: Starting a job that does not use checkpointing from a savepoint is broken ?

2018-01-18 Thread jelmer
jelmer 07:49 (0 minutes ago) to Eron Hey Eron, Thanks, you stated the issue better and more compact than I could I will not debate the wisdom of not using checkpoints but when migrating jobs you may not be aware if a job has checkpointing enabled, if you are not the author, and if you follow

How to make savepoints more robust in the face of refactorings ?

2018-01-28 Thread jelmer
Changing the class operators are nested in can break compatibility with existing savepoints. The following piece of code demonstrates this https://gist.github.com/jelmerk/e55abeb0876539975cd32ad0ced8141a If I change Operators in this file to Operators2 i will not be able to recover from a savepo

Re: How to make savepoints more robust in the face of refactorings ?

2018-01-29 Thread jelmer
> demonstrated code a anonymous-classed serializer is generated for some type. > From what I see, there shouldn’t be any anonymous-class serializers for > the code. Is the code you provided a “simplified” version of the actual > code in which you observed the restore error? > > Chee

Re: How to make savepoints more robust in the face of refactorings ?

2018-01-30 Thread jelmer
version of the actual > code in which you observed the restore error? > > Cheers, > Gordon > > > On 28 January 2018 at 6:00:32 PM, jelmer (jkupe...@gmail.com) wrote: > > Changing the class operators are nested in can break compatibility with > exis

Task manager not able to rejoin job manager after network hicup

2018-02-23 Thread jelmer
We've observed on our flink 1.4.0 setup that if for some reason the networking between the task manager and the job manager gets disrupted then the task manager is never able to reconnect. You'll end up with messages like this getting printed to the log repeatedly Trying to register at JobManager

Re: Task manager not able to rejoin job manager after network hicup

2018-02-23 Thread jelmer
wrote: > @Till Is this the expected behaviour or do you suspect something could be > going wrong? > > > On 23. Feb 2018, at 08:59, jelmer wrote: > > We've observed on our flink 1.4.0 setup that if for some reason the > networking between the task manager and the job mana

Re: Task manager not able to rejoin job manager after network hicup

2018-02-24 Thread jelmer
you can repro this > easily perhaps you can get to it faster. I will find the thread and resend. > > Thanks, > > -- Ashish > > On Fri, Feb 23, 2018 at 9:56 AM, jelmer > wrote: > We found out there's a taskmanager.exit-on-fatal-akka-error property that > will restart

Outputting the content of in flight session windows

2018-04-18 Thread jelmer
I defined a session window and I would like to write the contents of the window to storage before the window closes Initially I was doing this by setting a CountTrigger.of(1) on the session window. But this leads to very frequent writes. To remedy this i switched to a ContinuousProcessingTimeTrig

pre-initializing global window state

2018-05-07 Thread jelmer
Hi I am looking for some advice on how to solve the following problem I'd like to keep track of the all time last n events received for a user. An event on average takes up 500 bytes and here will be ten's of millions of users for which we need to keep this information. The list of events will be

Re: pre-initializing global window state

2018-05-07 Thread jelmer
log etc makes a lot of sense in many scenario's, but unfortunately this is not one of them. :-( Another alternative I thought of for this problem is to sidestep the window abstraction and fall back to a processing function with timers On 7 May 2018 at 17:31, Ken Krugler wrote: &g