Hey all,
I need to make sense of this behavior. Any help would be appreciated.
Here’s an example of a set of Flink checkpoint metrics I don’t understand.
This is the first operator in a job and as you can see the end-to-end time
for the checkpoint is long, but it’s not explained by either sync,
g%3E
>
> Hope that helps understanding what is going on.
>
> Best,
> Stephan
>
>
> On Thu, Sep 12, 2019 at 1:25 AM Seth Wiesman wrote:
>
> > Great timing, I just debugged this on Monday. E2e time is checkpoint
> > coordinator to checkpoint coordinator, so it
Here's the second screenshot I forgot to include:
https://pasteboard.co/IxhNIhc.png
On Fri, Sep 13, 2019 at 4:34 PM Jamie Grier wrote:
> Alright, here's another case where this is very pronounced. Here's a link
> to a couple of screenshots showing the overall stats for a s
t case, right? How
could it take 29 minutes to consume this data in the sink?
Anyway, I'd appreciate and feedback or insights.
Thanks :)
-Jamie
On Fri, Sep 13, 2019 at 12:11 PM Jamie Grier wrote:
> Thanks Seth and Stephan,
>
> Yup, I had intended to upload a image. Here it
also still be
> alignment buffers, which need to be processes while the next checkpoint has
> already started.
>
> Cheers,
>
> Konstantin
>
>
>
> On Sat, Sep 14, 2019 at 1:35 AM Jamie Grier
> wrote:
>
> > Here's the second screenshot I forgot to
This is great, Gyula! A colleague here at Lyft has also done some work
around bootstrapping DataStream programs and we've also talked a bit about
doing this by running DataSet programs.
On Fri, Aug 17, 2018 at 3:28 AM, Gyula Fóra wrote:
> Hi All!
>
> I want to share with you a little project we
I'll add to what Thomas already said.. The larger issue driving this is
that when reading from a source with many parallel partitions, especially
when reading lots of historical data (or recovering from downtime and there
is a backlog to read), it's quite common for there to develop an event-time
Okay, so I think there is a lot of agreement here about (a) This is a real
issue for people, and (b) an ideal long-term approach to solving it.
As Aljoscha and Elias said a full solution to this would be to also
redesign the source interface such that individual partitions are exposed
in the API a
r thoughts?
On Wed, Oct 10, 2018 at 6:58 AM Jamie Grier wrote:
> Okay, so I think there is a lot of agreement here about (a) This is a real
> issue for people, and (b) an ideal long-term approach to solving it.
>
> As Aljoscha and Elias said a full solution to this would be to also
Here's a doc I started describing some changes we would like to make
starting with the Kinesis Source.. It describes a refactoring of that code
specifically and also hopefully a pattern and some reusable code we can use
in the other sources as well. The end goal would be best-effort event-time
syn
ogress as the payloads. The JobManagerwill accumulator or state all
> the
> > > reported progress and then give responses for different source tasks.
> We
> > > can define a protocol for indicating the fast soruce task to sleep for
> > > specific time for example. To do so
Thanks Aljoscha for getting this effort going!
There's been plenty of discussion here already and I'll add my big +1 to
making this interface very simple to implement for a new
Source/SplitReader. Writing a new production quality connector for Flink
is very difficult today and requires a lot of d
One unfortunate problem with the current back-pressure detection mechanism
is that it doesn't work well with all of our sources. The problem is that
some sources (Kinesis for sure) emit elements from threads Flink knows
nothing about and therefore those stack traces aren't sampled. The result
is
.. Maybe we should add a way to register those threads such that they are
also sampled. Thoughts?
On Thu, Jan 3, 2019 at 10:25 AM Jamie Grier wrote:
> One unfortunate problem with the current back-pressure detection mechanism
> is that it doesn't work well with all of our so
+1 to try the bot solution and see how it goes.
On Fri, Jan 11, 2019 at 6:54 AM jincheng sun
wrote:
> +1 for the bot solution!
> and I think Timo‘s suggestion is very useful!
> Thanks,
> Jincheng
>
>
> Timo Walther 于2019年1月11日 周五22:44写道:
>
> > Thanks for bringing up this discussion again. +1 for
I'm not sure if this is required. It's quite convenient to be able to just
grab a single tarball and you've got everything you need.
I just did this for the latest binary release and it was 273MB and took
about 25 seconds to download. Of course I know connection speeds vary
quite a bit but I don
I had the same reaction initially as some of the others on this thread --
which is "Use Kafka quotas".. I agree that in general a service should
protect itself with it's own rate limiting rather than building it into
clients like the FlinkKafkaConsumer.
However, there are a few reasons we need to
This is awesome, Stephan! Thanks for doing this.
-Jamie
On Tue, Feb 26, 2019 at 9:29 AM Stephan Ewen wrote:
> Here is the pull request with a draft of the roadmap:
> https://github.com/apache/flink-web/pull/178
>
> Best,
> Stephan
>
> On Fri, Feb 22, 2019 at 5:18 AM Hequn Cheng wrote:
>
>> H
I think maybe if I understood this correctly this design is going in the
wrong direction. The problem with Flink logging, when you are running
multiple jobs in the same TMs, is not just about separating out the
business level logging into separate files. The Flink framework itself
logs many thing
We've run into an issue that limits the max parallelism of jobs we can run
and what it seems to boil down to is that the JobManager becomes
unresponsive while essentially spending all of it's time discarding
checkpoints from S3. This results in sluggish UI, sporadic
AkkaAskTimeouts, heartbeat miss
urious why you get heartbeat misses and akka timeouts during deletes.
> Are some parts of the deletes happening sychronously in the actor thread?
>
> On Wed, Mar 6, 2019 at 3:40 PM Jamie Grier
> wrote:
>
> > We've run into an issue that limits the max parallelism of jobs w
nk.apache.org
> > Subject: Re: JobManager scale limitation - Slow S3 checkpoint deletes
> >
> > Nice!
> >
> > Perhaps for file systems without TTL/expiration support (AFAIK includes
> > HDFS), cleanup could be performed in the task managers?
> >
> &
tion and compiling it.
> I’m new with this but I’d like to do a small contribution, I will send
> another email to the list explaining how I’ll try to help.
>
> Thanks!!
>
>
--
Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
ja...@data-artisans.com
gt;
> > The intended effects would be:
> > - reduce friction in the PR process created by basic oversights such as
> > checkstyle violations or missing tests
> > - provide a helping hand for new contributors
> >
> > I tried to condense the suggestion on the mailing list to make it not too
> > long and intimidating but at the same time cover the most important
> points.
> >
> > Looking forward to your input.
> > Best regards
> > martin
> >
>
--
Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
ja...@data-artisans.com
ed to select JDK and point to local jdk install path?
>
--
Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
ja...@data-artisans.com
Hey all,
I've noticed a few times now when trying to help users implement particular
things in the Flink API that it can be complicated to map what they know
they are trying to do onto higher-level Flink concepts such as windowing or
Connect/CoFlatMap/ValueState, etc.
At some point it just become
> > we could teach this to the users first but currently I thinm its not
> easy
> > > enough to be good starting point.
> > >
> > > The user needs to understand a lot about the system if the dont want to
> > > hurt other parts of the pipeline. For insance w
this be done?
>
The typical way to do this is to consume that configuration as a stream and
hold the configuration internally in the state of a particular user
function.
>
> Thanks
>
--
Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
ja...@data-artisans.com
. Please advise on the better approach to handle these kind of
> scenarios and how other applications are handling it. Thanks.
>
--
Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
ja...@data-artisans.com
ering the
> > callback, and the parameter would be passed to the onTimer method
> > - Allow users to pass custom callback functions when registering the
> > timers, but this would mean we have to support some sort of context for
> > accessing the state (like the window context b
here would
> be any potential scaling limitations as the processing capacity increases.
>
> Thanks
> Govind
>
--
Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
ja...@data-artisans.com
text: http://apache-flink-mailing-
> list-archive.1008284.n3.nabble.com/Categorize-or-
> GroupBy-datastream-data-and-process-with-CEP-separately-tp15139.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at
> Nabble.com.
>
--
Jamie Grier
data Arti
e.com/Categorize-or-
> GroupBy-datastream-data-and-process-with-CEP-separately-tp15139p15148.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at
> Nabble.com.
>
--
Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
ja...@data-artisans.com
has thought about the problem deeply
already, has use cases of their own they've run into or has ideas for a
solution to this problem.
Thanks for reading..
-Jamie
--
Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
ja...@data-artisans.com
ata = windowedResult.getSideOutput();
> >
> > Right now, the result of window operations is a
> > SingleOutputStreamOperator, same as it is for all DataStream
> operations.
> > Making the result type more specific, i.e. a WindowedOperator, would
> allow
> > us to a
g which user have no control nor definition
> > an extra UDF which essentially filter out all mainOutputs and only let
> > sideOutput pass (like filterFunction)
> >
> > Thanks,
> > Chen
> >
> > > On Feb 24, 2017, at 1:17 PM, Jamie Grier
> > wrote:
> &g
ime passes, i.e. 0, 10, 15, 20. For this we need to keep track of
> > > the watermarks of A and B separately. But after we have a correct
> > > watermark for the bucket, all we need to care about is the bucket
> > > watermarks. So somewhere (most probably at the source) we would
't like this and
> >>>>> complains about breaking binary backwards compatibility. +Robert
> >>> Metzger
> >>>>> Do you have an idea what we could do there?
> >>>>>
> >>>>> On Tue, 28 Feb 2017 at 12:39 Ufuk Celebi wrote:
> >>>>>
> >>>>>> On Tue, Feb 28, 2017 at 11:38 AM, Aljoscha Krettek <
> >>>> aljos...@apache.org>
> >>>>>> wrote:
> >>>>>>> I see the ProcessFunction as a bit of the generalised future of
> >>>>> FlatMap,
> >>>>>> so
> >>>>>>> to me it makes sense to only allow side outputs on the
> >>>> ProcessFunction
> >>>>>> but
> >>>>>>> I'm open for anything. If we decide for this I'm happy with an
> >>>>> additional
> >>>>>>> method on Collector.
> >>>>>>
> >>>>>> I think it's best to restrict this to ProcessFunction after all
> >>> (given
> >>>>>> that we allow it for non-keyed streams, etc.). ;-)
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>
--
Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
ja...@data-artisans.com
gt; > resulted in two design documents. I tried to consolidate those two
> and
> > > > > also added a section about implementation plans. This is the
> resulting
> > > > > FLIP:
> > > > >
> > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-
> > > > 17+Side+Inputs+for+DataStream+API
> > > > >
> > > > >
> > > > > In terms of semantics I tried to go with the minimal viable
> solution.
> > > > > The part that needs discussing is how we want to implement this. I
> > > > > outlined three possible implementation plans in the FLIP but what
> it
> > > > > boils down to is that we need to introduce some way of getting
> several
> > > > > inputs into an operator/task.
> > > > >
> > > > >
> > > > > Please have a look at the doc and let us know what you think.
> > > > >
> > > > >
> > > > >
> > > > > Best,
> > > > >
> > > > > Aljoscha
> > > > >
> > > > >
> > > > >
> > > > > [1]
> > > > > https://lists.apache.org/thread.html/
> 797df0ba066151b77c7951fd7d603a
> > > > 8afd7023920d0607a0c6337db3@1462181294@%3Cdev.flink.apache.org%3E
> > > > >
> > > >
> > >
>
--
Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
ja...@data-artisans.com
Sorry for coming very late to this thread. I have not contributed much to
Flink publicly for quite some time but I have been involved with Flink,
daily, for years now and I'm keenly interested in where we take Flink SQL
going forward.
Thanks for the proposal!! I think it's definitely a step in t
y
> last. I don't think we need yet another new concept for this. I think
> that will just add to users' confusion and learning curve which is already
> substantial with Flink. We need to make things easier rather than harder.
>
> Also, just to clarify, and sorry if my pre
e, etc. It would be much
appreciated!
-Jamie Grier
On 2022/10/28 06:06:49 Shengkai Fang wrote:
> Hi.
>
> > Is there a possibility for us to get engaged and at least introduce
> initial changes to support authentication/authorization?
>
> Yes. You can write a FLIP about the desig
Big +1 on trying to come up with a common framework for partition-based,
replayable sources. There is so much common code to be written that makes
it possible to write correct connectors and Gordon's bullet points are
exactly those -- and it's not just Kinesis and Kafka. It's also true for
readin
I know this is a very simplistic idea but...
In general the issue Eron is describing occurs whenever two (or more)
parallel partitions are assigned to the same Flink sub-task and there is
large time delta between them. This problem exists though largely because
we are not making any decisions abo
Is the `flink-connector-filesystem` connector supposed to work with the
latest hadoop-free Flink releases, say along with the `flink-s3-fs-presto`
filesystem implementation?
-Jamie
ovide a BucketingSink that uses the Flink
> FileSystem and also works well with eventually consistent file systems.
>
> --
> Aljoscha
>
> > On 23. Feb 2018, at 06:31, Jamie Grier wrote:
> >
> > Is the `flink-connector-filesystem` connector supposed to work with the
>
on a Hadoop-free cluster. Maybe...
>
> > On 23. Feb 2018, at 13:32, Jamie Grier wrote:
> >
> > Thanks, Aljoscha :)
> >
> > So is it possible to continue to use the new "native' fllesystems along
> > with the BucketingSink by including the Hadoop depende
I think we need to modify the way we write checkpoints to S3 for high-scale
jobs (those with many total tasks). The issue is that we are writing all
the checkpoint data under a common key prefix. This is the worst case
scenario for S3 performance since the key is used as a partition key.
In the
Jamie Grier created FLINK-10154:
---
Summary: Make sure we always read at least one record in
KinesisConnector
Key: FLINK-10154
URL: https://issues.apache.org/jira/browse/FLINK-10154
Project: Flink
Jamie Grier created FLINK-10484:
---
Summary: New latency tracking metrics format causes metrics
cardinality explosion
Key: FLINK-10484
URL: https://issues.apache.org/jira/browse/FLINK-10484
Project
Jamie Grier created FLINK-10886:
---
Summary: Event time synchronization across sources
Key: FLINK-10886
URL: https://issues.apache.org/jira/browse/FLINK-10886
Project: Flink
Issue Type
Jamie Grier created FLINK-10887:
---
Summary: Add source watermarking tracking to the JobMaster
Key: FLINK-10887
URL: https://issues.apache.org/jira/browse/FLINK-10887
Project: Flink
Issue Type
Jamie Grier created FLINK-10888:
---
Summary: Expose new global watermark RPC to sources
Key: FLINK-10888
URL: https://issues.apache.org/jira/browse/FLINK-10888
Project: Flink
Issue Type: Sub
Jamie Grier created FLINK-11617:
---
Summary: CLONE - Handle AmazonKinesisException gracefully in
Kinesis Streaming Connector
Key: FLINK-11617
URL: https://issues.apache.org/jira/browse/FLINK-11617
Jamie Grier created FLINK-3617:
--
Summary: NPE from CaseClassSerializer when dealing with null
Option field
Key: FLINK-3617
URL: https://issues.apache.org/jira/browse/FLINK-3617
Project: Flink
Jamie Grier created FLINK-3627:
--
Summary: Task stuck on lock in StreamSource when cancelling
Key: FLINK-3627
URL: https://issues.apache.org/jira/browse/FLINK-3627
Project: Flink
Issue Type: Bug
Jamie Grier created FLINK-3679:
--
Summary: DeserializationSchema should handle zero or more outputs
for every input
Key: FLINK-3679
URL: https://issues.apache.org/jira/browse/FLINK-3679
Project: Flink
Jamie Grier created FLINK-3680:
--
Summary: Remove or improve (not set) text in the Job Plan UI
Key: FLINK-3680
URL: https://issues.apache.org/jira/browse/FLINK-3680
Project: Flink
Issue Type
Jamie Grier created FLINK-4391:
--
Summary: Provide support for asynchronous operations over streams
Key: FLINK-4391
URL: https://issues.apache.org/jira/browse/FLINK-4391
Project: Flink
Issue
Jamie Grier created FLINK-4947:
--
Summary: Make all configuration possible via flink-conf.yaml and
CLI.
Key: FLINK-4947
URL: https://issues.apache.org/jira/browse/FLINK-4947
Project: Flink
Jamie Grier created FLINK-4948:
--
Summary: Consider using checksums or similar to detect bad
checkpoints
Key: FLINK-4948
URL: https://issues.apache.org/jira/browse/FLINK-4948
Project: Flink
Jamie Grier created FLINK-4980:
--
Summary: Include example source code in Flink binary distribution
Key: FLINK-4980
URL: https://issues.apache.org/jira/browse/FLINK-4980
Project: Flink
Issue
Jamie Grier created FLINK-5634:
--
Summary: Flink should not always redirect stdout to a file.
Key: FLINK-5634
URL: https://issues.apache.org/jira/browse/FLINK-5634
Project: Flink
Issue Type: Bug
Jamie Grier created FLINK-5635:
--
Summary: Improve Docker tooling to make it easier to build images
and launch Flink via Docker tools
Key: FLINK-5635
URL: https://issues.apache.org/jira/browse/FLINK-5635
Jamie Grier created FLINK-6199:
--
Summary: Single outstanding Async I/O operation per key
Key: FLINK-6199
URL: https://issues.apache.org/jira/browse/FLINK-6199
Project: Flink
Issue Type
Jamie Grier created FLINK-9061:
--
Summary: S3 checkpoint data not partitioned well -- causes errors
and poor performance
Key: FLINK-9061
URL: https://issues.apache.org/jira/browse/FLINK-9061
Project
66 matches
Mail list logo