Re: [DISCUSS] FLIP-67: Global partitions lifecycle

2019-10-20 Thread Zhu Zhu
Thanks Chesnay for proposing this FLIP! And sorry for the late response on
it.
The FLIP overall looks good to me, except for one question.

- If a cluster partition does not exist in RM, how can users tell whether
it is not produced yet, or it is already released?
Users/InteractiveQuery may need this information to decide to whether to
wait or re-execute the producer job.
One way I can think of is to also check the producer job's state --
unavailable partition of a finished job means the partition is released.
But as the cluster partition is notified to RM via TM heartbeat, there can
be bad case if job is finished but the partition is not updated to RM yet.
One solution of the bad case might be that TM notifies RM instantly when
partitions are promoted, as a supplementary to the TM heartbeat way. It
also shortens the time that a consumer job waits for a cluster partition to
become available, especially for a sequence of short lived jobs. This
however introduces JM dependency on RM on job finishes, which is unwanted.


Thanks,
Zhu Zhu

Chesnay Schepler  于2019年10月15日周二 下午6:48写道:

> I have updated the FLIP.
>
> - adopted job-/cluster partitions naming scheme
> - out-lined interface for new component living in the RM (currently
> called ThinShuffleMaster, but I'm not a fan of the name. Suggestions
> would be appreciated)
> - added a note that the ShuffleService changes are only necessary for
> external shuffle services, which could be omitted in a first version
>
> Unless there are objections I'll start a vote thread later today.
>
> On 14/10/2019 06:28, Zhijiang wrote:
> > Thanks for these further considerations Chesnay!
> >
> > I guess we might have some misunderstanding. Actually I was not
> > against the previous proposal Till suggested before, and I think it is
> > a formal way to do that.
> >
> > And my previous proposal was not for excluding the ShuffleService
> > completely. The ShuffleService can be regarded as a factory for
> > creating ShuffleMaster on JM/RM side and creating ShuffleEnvironment
> > on TE side.
> >
> >  *
> > For the ShuffleEnvironment on TE side: I do not have concerns
> > always. The TE receives RPC call for deleting local/global
> > partitions and then handle them via ShuffleEnvironment, just the
> > similar way as local partitions now.
> >  *
> > For the ShuffleMaster side: I saw some previous disuccsions on
> > multiple ShuffleMaster instances run in different components. I
> > was not against this way in essence, but only wonder it might
> > bring this feature complex to consider that. So my proposal was
> > only for excluding ShuffleMaster if possible to make
> > implementation a bit easy. I thought there might have a somewhat
> > PartitionTracker component in RM for tracking/deleting global
> > partitions, just as we did the way now in JM. The partition state
> > is reported from TE and maintained in PartitionTracker of RM, and
> > the PartitionTracker could trigger global partition release with
> > TE gateway directly, and not further via ShuffleMaster(it is also
> > stateless now). And actually in existing PartitionTrackerImpl in
> > JM, the PRC call on TE#releasePartitions is also triggered not via
> > ShuffleMaster in some cases, and it can be regareded as a shortcut
> > way. Of course I am also in favour of via ShuffleMaster to call
> > the actual release partition always, and the form seems elegant.
> >
> > I do not expect my inconsequential thought would block this feature
> > ongoing and disturb your previous conclusion. Moreover, Till's recent
> > reply already dispels my previous concern. :)
> >
> > Best,
> > Zhijiang
> >
> > --
> > From:Chesnay Schepler 
> > Send Time:2019年10月14日(星期一) 07:00
> > To:dev ; Till Rohrmann
> > ; zhijiang  .invalid>
> > Subject:Re: [DISCUSS] FLIP-67: Global partitions lifecycle
> >
> > I'm quite torn on whether to exclude the ShuffleServices from the
> > proposal. I think I'm now on my third or fourth iteration for a
> >
>  response, so I'll just send both so I can stop thinking for a bit about
> >
> > whether to push for one or the other:
> >
> > Opinion A, aka "Nu Uh":
> >
> >
>  I'm not in favor of excluding the shuffle master from this proposal;
> >
>  I believe it raises interesting questions that should be discussed
> >
>  beforehand; otherwise we may just end up developing ourselves into a
> > corner.
> > Unless there are good reasons for doing so I'd prefer to keep the
> > functionality across shuffle services consistent.
> > And man, my last sentence is giving me headaches (how can you
> >
>  introduce inconsistencies across shuffle services if you don't even
> > touch them?..)
> >
> >
>  Ultimately the RM only needs the ShuffleService for 2 things, which
> > are fairly straight-for

[jira] [Created] (FLINK-14464) Introduce the 'AbstractUserClassPathJobGraphRetriever' which could construct the 'user code class path' from the "job" dir.

2019-10-20 Thread Guowei Ma (Jira)
Guowei Ma created FLINK-14464:
-

 Summary: Introduce the 'AbstractUserClassPathJobGraphRetriever' 
which could construct the 'user code class path' from the "job" dir.
 Key: FLINK-14464
 URL: https://issues.apache.org/jira/browse/FLINK-14464
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.9.0, 1.10.0
Reporter: Guowei Ma






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14465) Let StandaloneJobClusterEntrypoint use user code class loader

2019-10-20 Thread Guowei Ma (Jira)
Guowei Ma created FLINK-14465:
-

 Summary: Let StandaloneJobClusterEntrypoint use user code class 
loader
 Key: FLINK-14465
 URL: https://issues.apache.org/jira/browse/FLINK-14465
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.9.0, 1.10.0
Reporter: Guowei Ma






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14466) Let YarnJobClusterEntrypoint use user code class loader

2019-10-20 Thread Guowei Ma (Jira)
Guowei Ma created FLINK-14466:
-

 Summary: Let YarnJobClusterEntrypoint use user code class loader
 Key: FLINK-14466
 URL: https://issues.apache.org/jira/browse/FLINK-14466
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.9.0, 1.10.0
Reporter: Guowei Ma






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14467) Let MesosJobClusterEntrypoint use user code class loader

2019-10-20 Thread Guowei Ma (Jira)
Guowei Ma created FLINK-14467:
-

 Summary: Let MesosJobClusterEntrypoint use user code class loader
 Key: FLINK-14467
 URL: https://issues.apache.org/jira/browse/FLINK-14467
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.9.0, 1.10.0
Reporter: Guowei Ma






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14468) Update Kubernetes docs

2019-10-20 Thread Maximilian Bode (Jira)
Maximilian Bode created FLINK-14468:
---

 Summary: Update Kubernetes docs
 Key: FLINK-14468
 URL: https://issues.apache.org/jira/browse/FLINK-14468
 Project: Flink
  Issue Type: Task
  Components: Documentation
Reporter: Maximilian Bode


Two minor improvements to documented Kubernetes resource definitions:
 * avoid referencing deprecated extensions/v1beta1/Deployment
 * run unprivileged



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [ANNOUNCE] Apache Flink 1.9.1 released

2019-10-20 Thread Thomas Weise
Thanks Jark and everyone who contributed to the release!

I took 1.9.1 for a test drive, upgrading an existing 1.8.x deployment. This
brought in the shiny new UI :)

Everything looks good so far, except an error message that is flickering on
the job overview page: "Server response message: unable to load requested
file.."

Looking at what is being requested makes me gasp..


   1.
  
http://somehost/jobs/047082c3386e250a1e0607360a003022/vertices/5995a65cc0e8d66f0ffa0d05c183c8f8/metrics?get=0.currentInputWatermark,1.currentInputWatermark,2.currentInputWatermark,3.currentInputWatermark,4.currentInputWatermark,5.currentInputWatermark,6.currentInputWatermark,7.currentInputWatermark,8.currentInputWatermark,9.currentInputWatermark,10.currentInputWatermark,11.currentInputWatermark,12.currentInputWatermark,13.currentInputWatermark,14.currentInputWatermark,15.currentInputWatermark,16.currentInputWatermark,17.currentInputWatermark,18.currentInputWatermark,19.currentInputWatermark,20.currentInputWatermark,21.currentInputWatermark,22.currentInputWatermark,23.currentInputWatermark,24.currentInputWatermark,25.currentInputWatermark,26.currentInputWatermark,27.currentInputWatermark,28.currentInputWatermark,29.currentInputWatermark,30.currentInputWatermark,,255.currentInputWatermark
  2.

This just won't work for high parallelism jobs. The watermark display was
already broken with the old UI, thankfully now it's a bit more obvious. Is
this a known/previously reported issue for 1.9.x?

Thanks,
Thomas


On Sat, Oct 19, 2019 at 6:24 PM Hequn Cheng  wrote:

> Thanks a lot to Jark, Jincheng, and everyone that make this release
> possible.
>
> Best, Hequn
>
> On Sat, Oct 19, 2019 at 10:29 PM Zili Chen  wrote:
>
> > Thanks a lot for being release manager Jark. Great work!
> >
> > Best,
> > tison.
> >
> >
> > Till Rohrmann  于2019年10月19日周六 下午10:15写道:
> >
> >> Thanks a lot for being our release manager Jark and thanks to everyone
> >> who has helped to make this release possible.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Sat, Oct 19, 2019 at 3:26 PM Jark Wu  wrote:
> >>
> >>>  Hi,
> >>>
> >>> The Apache Flink community is very happy to announce the release of
> >>> Apache Flink 1.9.1, which is the first bugfix release for the Apache
> Flink
> >>> 1.9 series.
> >>>
> >>> Apache Flink® is an open-source stream processing framework for
> >>> distributed, high-performing, always-available, and accurate data
> streaming
> >>> applications.
> >>>
> >>> The release is available for download at:
> >>> https://flink.apache.org/downloads.html
> >>>
> >>> Please check out the release blog post for an overview of the
> >>> improvements for this bugfix release:
> >>> https://flink.apache.org/news/2019/10/18/release-1.9.1.html
> >>>
> >>> The full release notes are available in Jira:
> >>> https://issues.apache.org/jira/projects/FLINK/versions/12346003
> >>>
> >>> We would like to thank all contributors of the Apache Flink community
> >>> who helped to verify this release and made this release possible!
> >>> Great thanks to @Jincheng for helping finalize this release.
> >>>
> >>> Regards,
> >>> Jark Wu
> >>>
> >>
>


[ANNOUNCE] Weekly Community Update 2019/42

2019-10-20 Thread Konstantin Knauf
Dear community,

happy to share this week's community update with the release of Flink
1.9.1, a couple of threads about our development process, the sql ecosystem
and more.

Flink Development
==

* [releases] *Apache Flink 1.9.1* was released. [1,2]

* [statefun] Stephan has started a separate discussion on whether to
maintain *Stateful Functions* in a separate repository or the Flink main
repository after its contribution to Apache Flink. The majority seems to
prefer a separate repository e.g. to enable quicker iterations on the new
code base and not to overwhelm new contributors to Stateful Functions. [3]

* [development process] The *NOTICE* file and the directory for licenses of
bundled dependencies for binary releases is now auto-generated during the
release process. [4]

* [development process] According to our *FLIP process* the introduction of
a new config option requires a FLIP (and vote). Aljoscha has started a
discussion to clarify this point, as this is currently not always the case.
Looks like the majority leans towards a vote for every configuration
change, but possibly making it more lightweight than a proper FLIP. [5]

* [development process] Xiyuan gave an update on *Flink's ARM support.*
Travis ARM Support is in alpha now (alternative to previously proposed
OpenLab), and regardless of the CI system
Xiyuan points the community to a list of PRs/Jiras, which need to be
solved/reviewed. [6]

* [configuration] The discussion on FLIP-59 to make the execution
configuration (ExecutionConfig et al.) configurable via the Flink
Configuration has been revived a bit and focuses on alignment with FLIP-73
and naming of the different configurations now. [7]

* [sql] Based on feedback from the user community, Timo proposes to rename
the "ANY" datatype "OPAQUE" highlighting that a field of type "ANY" does
not hold any type, but a data type that is unknown to Flink. [8]

* [sql] Jark has started a discussion on FLIP-80 [9] about how to
de/serialize expressions in catalogs. [10]

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ANNOUNCE-Apache-Flink-1-9-1-released-tp34170.html
[2] https://flink.apache.org/news/2019/10/18/release-1.9.1.html
[3]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stateful-Functions-in-which-form-to-contribute-same-or-different-repository-tp34034.html
[4]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/NOTICE-Binary-licensing-is-now-auto-generated-tp34121.html
[5]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-policy-for-introducing-config-option-keys-tp34011p34045.html
[6]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ARM-support-Travis-ARM-CI-is-now-in-Alpha-Release-tp34039.html
[7]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-59-Enable-execution-configuration-from-Configuration-object-tp32359.html
[8]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Rename-the-SQL-ANY-type-to-OPAQUE-type-tp34162.html
[9]
https://docs.google.com/document/d/1LxPEzbPuEVWNixb1L_USv0gFgjRMgoZuMsAecS_XvdE/edit?usp=sharing
[10]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISUCSS-FLIP-80-Expression-String-Serializable-and-Deserializable-tp34146.html

Notable Bugs
==

* [FLINK-14429] [1.9.1] [1.8.2] When you run a batch job on YARN in
non-detached mode, it will be reported as SUCCEEDED if when it actually
FAILED. [11]

[11] https://issues.apache.org/jira/browse/FLINK-14429

Events, Blog Posts, Misc
===

* This discussion on the dev@ mailing list might be interesting to follow
for people using the StreamingFileSink or BucketingSink with S3.  [12]

* There will be full-day meetup with six talks in the Bangalore Kafka Group
on the 2nd of November including at least three Flink talks by *Timo Walter*
(Ververica), *Shashank Agarwal* (Razorpay) and *Rasyid Hakim* (GoJek).  [13]

[12]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/performances-of-S3-writing-with-many-buckets-in-parallel-tp34021p34050.html
[13] https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/265285812/

Cheers,

Konstantin (@snntrable)

-- 

Konstantin Knauf | Solutions Architect

+49 160 91394525


Follow us @VervericaData Ververica 


--

Join Flink Forward  - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Tony) Cheng


Re: [DISCUSS] FLIP-67: Global partitions lifecycle

2019-10-20 Thread Till Rohrmann
Hi Zhu Zhu,

the cluster partition does not need to be registered at the RM before it
can be used. The cluster partition descriptor will be reported to the
client as part of the job execution result. This information is used to
construct a JobGraph which can consume from a cluster partition. The
cluster partition descriptor contains all the information necessary to read
the partition. Hence, a job consuming this partition will simply deploy the
consumer on a TM and then read the cluster partition described by the
cluster partition descriptor. If the partition is no longer available, then
the job will fail and the client needs to handle the situation. If the
client knows how to reprocess the partition, then it would submit the
producing job.

Cheers,
Till

On Sun, Oct 20, 2019 at 12:23 PM Zhu Zhu  wrote:

> Thanks Chesnay for proposing this FLIP! And sorry for the late response on
> it.
> The FLIP overall looks good to me, except for one question.
>
> - If a cluster partition does not exist in RM, how can users tell whether
> it is not produced yet, or it is already released?
> Users/InteractiveQuery may need this information to decide to whether to
> wait or re-execute the producer job.
> One way I can think of is to also check the producer job's state --
> unavailable partition of a finished job means the partition is released.
> But as the cluster partition is notified to RM via TM heartbeat, there can
> be bad case if job is finished but the partition is not updated to RM yet.
> One solution of the bad case might be that TM notifies RM instantly when
> partitions are promoted, as a supplementary to the TM heartbeat way. It
> also shortens the time that a consumer job waits for a cluster partition to
> become available, especially for a sequence of short lived jobs. This
> however introduces JM dependency on RM on job finishes, which is unwanted.
>
>
> Thanks,
> Zhu Zhu
>
> Chesnay Schepler  于2019年10月15日周二 下午6:48写道:
>
>> I have updated the FLIP.
>>
>> - adopted job-/cluster partitions naming scheme
>> - out-lined interface for new component living in the RM (currently
>> called ThinShuffleMaster, but I'm not a fan of the name. Suggestions
>> would be appreciated)
>> - added a note that the ShuffleService changes are only necessary for
>> external shuffle services, which could be omitted in a first version
>>
>> Unless there are objections I'll start a vote thread later today.
>>
>> On 14/10/2019 06:28, Zhijiang wrote:
>> > Thanks for these further considerations Chesnay!
>> >
>> > I guess we might have some misunderstanding. Actually I was not
>> > against the previous proposal Till suggested before, and I think it is
>> > a formal way to do that.
>> >
>> > And my previous proposal was not for excluding the ShuffleService
>> > completely. The ShuffleService can be regarded as a factory for
>> > creating ShuffleMaster on JM/RM side and creating ShuffleEnvironment
>> > on TE side.
>> >
>> >  *
>> > For the ShuffleEnvironment on TE side: I do not have concerns
>> > always. The TE receives RPC call for deleting local/global
>> > partitions and then handle them via ShuffleEnvironment, just the
>> > similar way as local partitions now.
>> >  *
>> > For the ShuffleMaster side: I saw some previous disuccsions on
>> > multiple ShuffleMaster instances run in different components. I
>> > was not against this way in essence, but only wonder it might
>> > bring this feature complex to consider that. So my proposal was
>> > only for excluding ShuffleMaster if possible to make
>> > implementation a bit easy. I thought there might have a somewhat
>> > PartitionTracker component in RM for tracking/deleting global
>> > partitions, just as we did the way now in JM. The partition state
>> > is reported from TE and maintained in PartitionTracker of RM, and
>> > the PartitionTracker could trigger global partition release with
>> > TE gateway directly, and not further via ShuffleMaster(it is also
>> > stateless now). And actually in existing PartitionTrackerImpl in
>> > JM, the PRC call on TE#releasePartitions is also triggered not via
>> > ShuffleMaster in some cases, and it can be regareded as a shortcut
>> > way. Of course I am also in favour of via ShuffleMaster to call
>> > the actual release partition always, and the form seems elegant.
>> >
>> > I do not expect my inconsequential thought would block this feature
>> > ongoing and disturb your previous conclusion. Moreover, Till's recent
>> > reply already dispels my previous concern. :)
>> >
>> > Best,
>> > Zhijiang
>> >
>> > --
>> > From:Chesnay Schepler 
>> > Send Time:2019年10月14日(星期一) 07:00
>> > To:dev ; Till Rohrmann
>> > ; zhijiang > .invalid>
>> > Subject:Re: [DISCUSS] FLIP-67: Global partitions lifecycle
>> >
>> > I'm quite torn on whether to exclude the ShuffleServices from the
>> > proposa

Re: [SURVEY] Dropping non Credit-based Flow Control

2019-10-20 Thread SHI Xiaogang
+1

Credit-based flow control has long been used in our production environment
as well. It works fine and there seems no reason to use non credit-based
implementation.

Regards,
Xiaogang

Zhu Zhu  于2019年10月19日周六 下午3:01写道:

> +1 to drop the non credit-based flow control.
> We have turned to credit-based flow control for long in production. It has
> been good for all our cases.
> The non credit-based flow control code has been a burden when we are trying
> to change the network stack code for new features.
>
> Thanks,
> Zhu Zhu
>
>
> Biao Liu  于2019年10月10日周四 下午5:45写道:
>
> > Thanks for start this survey, Piotr.
> >
> > We have benefitted from credit-based flow control a lot. I can't figure
> out
> > a reason to use non credit-based model.
> > I think we have kept the older code paths long enough (1.5 -> 1.9).
> That's
> > a big burden to maintain. Especially there are a lot duplicated codes
> > between credit-based and non credit-based model.
> >
> > So +1 to do the cleanup.
> >
> > Thanks,
> > Biao /'bɪ.aʊ/
> >
> >
> >
> > On Thu, 10 Oct 2019 at 11:15, zhijiang  > .invalid>
> > wrote:
> >
> > > Thanks for bringing this survey Piotr.
> > >
> > > Actually I was also trying to dropping the non credit-based code path
> > from
> > > release-1.9, and now I think it is the proper time to do it motivated
> by
> > > [3].
> > > The credit-based mode is as default from Flink 1.5 and it has been
> > > verified to be stable and reliable in many versions. In Alibaba we are
> > > always using the default credit-based mode in all products.
> > > It can reduce much overhead of maintaining non credit-based code path,
> so
> > > +1 from my side to drop it.
> > >
> > > Best,
> > > Zhijiang
> > > --
> > > From:Piotr Nowojski 
> > > Send Time:2019年10月2日(星期三) 17:01
> > > To:dev 
> > > Subject:[SURVEY] Dropping non Credit-based Flow Control
> > >
> > > Hi,
> > >
> > > In Flink 1.5 we have introduced Credit-based Flow Control [1] in the
> > > network stack. Back then we were aware about potential downsides of it
> > [2]
> > > and we decided to keep the old model in the code base ( configurable by
> > > setting  `taskmanager.network.credit-model: false` ). Now, that we are
> > > about to modify internals of the network stack again [3], it might be a
> > > good time to clean up the code and remove the older code paths.
> > >
> > > Is anyone still using the non default non Credit-based model (
> > > `taskmanager.network.credit-model: false`)? If so, why?
> > >
> > > Piotrek
> > >
> > > [1] https://flink.apache.org/2019/06/05/flink-network-stack.html <
> > > https://flink.apache.org/2019/06/05/flink-network-stack.html>
> > > [2]
> > >
> >
> https://flink.apache.org/2019/06/05/flink-network-stack.html#what-do-we-gain-where-is-the-catch
> > > <
> > >
> >
> https://flink.apache.org/2019/06/05/flink-network-stack.html#what-do-we-gain-where-is-the-catch
> > > >
> > > [3]
> > >
> >
> https://lists.apache.org/thread.html/a2b58b7b2b24b9bd4814b2aa51d2fc44b08a919eddbb5b1256be5b6a@%3Cdev.flink.apache.org%3E
> > > <
> > >
> >
> https://lists.apache.org/thread.html/a2b58b7b2b24b9bd4814b2aa51d2fc44b08a919eddbb5b1256be5b6a@%3Cdev.flink.apache.org%3E
> > > >
> > >
> > >
> >
>


Re: [ANNOUNCE] Apache Flink 1.9.1 released

2019-10-20 Thread Yadong Xie
Hi Thomas Weise

We have an issue related to this problem
https://issues.apache.org/jira/browse/FLINK-14147, but I think you could
open another JIRA
The watermark data comes from metric REST API, which is too heavy for high
parallelism jobs, the API almost sure will break when the parallelism is
too high
We could hide the error message as a temp solution and refactor the API to
solve this issue completely later.

Thomas Weise  于2019年10月21日周一 上午1:31写道:

> Thanks Jark and everyone who contributed to the release!
>
> I took 1.9.1 for a test drive, upgrading an existing 1.8.x deployment. This
> brought in the shiny new UI :)
>
> Everything looks good so far, except an error message that is flickering on
> the job overview page: "Server response message: unable to load requested
> file.."
>
> Looking at what is being requested makes me gasp..
>
>
>1.
>
> http://somehost/jobs/047082c3386e250a1e0607360a003022/vertices/5995a65cc0e8d66f0ffa0d05c183c8f8/metrics?get=0.currentInputWatermark,1.currentInputWatermark,2.currentInputWatermark,3.currentInputWatermark,4.currentInputWatermark,5.currentInputWatermark,6.currentInputWatermark,7.currentInputWatermark,8.currentInputWatermark,9.currentInputWatermark,10.currentInputWatermark,11.currentInputWatermark,12.currentInputWatermark,13.currentInputWatermark,14.currentInputWatermark,15.currentInputWatermark,16.currentInputWatermark,17.currentInputWatermark,18.currentInputWatermark,19.currentInputWatermark,20.currentInputWatermark,21.currentInputWatermark,22.currentInputWatermark,23.currentInputWatermark,24.currentInputWatermark,25.currentInputWatermark,26.currentInputWatermark,27.currentInputWatermark,28.currentInputWatermark,29.currentInputWatermark,30.currentInputWatermark,
>LOT OF STUFF>,255.currentInputWatermark
>   2.
>
> This just won't work for high parallelism jobs. The watermark display was
> already broken with the old UI, thankfully now it's a bit more obvious. Is
> this a known/previously reported issue for 1.9.x?
>
> Thanks,
> Thomas
>
>
> On Sat, Oct 19, 2019 at 6:24 PM Hequn Cheng  wrote:
>
> > Thanks a lot to Jark, Jincheng, and everyone that make this release
> > possible.
> >
> > Best, Hequn
> >
> > On Sat, Oct 19, 2019 at 10:29 PM Zili Chen  wrote:
> >
> > > Thanks a lot for being release manager Jark. Great work!
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > Till Rohrmann  于2019年10月19日周六 下午10:15写道:
> > >
> > >> Thanks a lot for being our release manager Jark and thanks to everyone
> > >> who has helped to make this release possible.
> > >>
> > >> Cheers,
> > >> Till
> > >>
> > >> On Sat, Oct 19, 2019 at 3:26 PM Jark Wu  wrote:
> > >>
> > >>>  Hi,
> > >>>
> > >>> The Apache Flink community is very happy to announce the release of
> > >>> Apache Flink 1.9.1, which is the first bugfix release for the Apache
> > Flink
> > >>> 1.9 series.
> > >>>
> > >>> Apache Flink® is an open-source stream processing framework for
> > >>> distributed, high-performing, always-available, and accurate data
> > streaming
> > >>> applications.
> > >>>
> > >>> The release is available for download at:
> > >>> https://flink.apache.org/downloads.html
> > >>>
> > >>> Please check out the release blog post for an overview of the
> > >>> improvements for this bugfix release:
> > >>> https://flink.apache.org/news/2019/10/18/release-1.9.1.html
> > >>>
> > >>> The full release notes are available in Jira:
> > >>> https://issues.apache.org/jira/projects/FLINK/versions/12346003
> > >>>
> > >>> We would like to thank all contributors of the Apache Flink community
> > >>> who helped to verify this release and made this release possible!
> > >>> Great thanks to @Jincheng for helping finalize this release.
> > >>>
> > >>> Regards,
> > >>> Jark Wu
> > >>>
> > >>
> >
>


[jira] [Created] (FLINK-14469) Drop Python 2 support

2019-10-20 Thread Dian Fu (Jira)
Dian Fu created FLINK-14469:
---

 Summary: Drop Python 2 support
 Key: FLINK-14469
 URL: https://issues.apache.org/jira/browse/FLINK-14469
 Project: Flink
  Issue Type: Task
  Components: API / Python
Affects Versions: 1.10.0
Reporter: Dian Fu
 Fix For: 1.10.0


As discussed in the 
[ML|[http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Drop-Python-2-support-for-1-10-tt33962.html]],
 the aim of this Jira is to drop the Python 2 support. This includes the 
following work:
1. Remove Python 2.7 CI tests
2. Throw an exception if Python below 3.5 is used
3. Clean up the code which is specific for Python 2.x



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] FLIP-67: Global partitions lifecycle

2019-10-20 Thread Zhu Zhu
Thanks Till for the explanation! That looks good to me.

Thanks,
Zhu Zhu

Till Rohrmann  于2019年10月21日周一 上午2:45写道:

> Hi Zhu Zhu,
>
> the cluster partition does not need to be registered at the RM before it
> can be used. The cluster partition descriptor will be reported to the
> client as part of the job execution result. This information is used to
> construct a JobGraph which can consume from a cluster partition. The
> cluster partition descriptor contains all the information necessary to read
> the partition. Hence, a job consuming this partition will simply deploy the
> consumer on a TM and then read the cluster partition described by the
> cluster partition descriptor. If the partition is no longer available, then
> the job will fail and the client needs to handle the situation. If the
> client knows how to reprocess the partition, then it would submit the
> producing job.
>
> Cheers,
> Till
>
> On Sun, Oct 20, 2019 at 12:23 PM Zhu Zhu  wrote:
>
> > Thanks Chesnay for proposing this FLIP! And sorry for the late response
> on
> > it.
> > The FLIP overall looks good to me, except for one question.
> >
> > - If a cluster partition does not exist in RM, how can users tell whether
> > it is not produced yet, or it is already released?
> > Users/InteractiveQuery may need this information to decide to whether to
> > wait or re-execute the producer job.
> > One way I can think of is to also check the producer job's state --
> > unavailable partition of a finished job means the partition is released.
> > But as the cluster partition is notified to RM via TM heartbeat, there
> can
> > be bad case if job is finished but the partition is not updated to RM
> yet.
> > One solution of the bad case might be that TM notifies RM instantly when
> > partitions are promoted, as a supplementary to the TM heartbeat way. It
> > also shortens the time that a consumer job waits for a cluster partition
> to
> > become available, especially for a sequence of short lived jobs. This
> > however introduces JM dependency on RM on job finishes, which is
> unwanted.
> >
> >
> > Thanks,
> > Zhu Zhu
> >
> > Chesnay Schepler  于2019年10月15日周二 下午6:48写道:
> >
> >> I have updated the FLIP.
> >>
> >> - adopted job-/cluster partitions naming scheme
> >> - out-lined interface for new component living in the RM (currently
> >> called ThinShuffleMaster, but I'm not a fan of the name. Suggestions
> >> would be appreciated)
> >> - added a note that the ShuffleService changes are only necessary for
> >> external shuffle services, which could be omitted in a first version
> >>
> >> Unless there are objections I'll start a vote thread later today.
> >>
> >> On 14/10/2019 06:28, Zhijiang wrote:
> >> > Thanks for these further considerations Chesnay!
> >> >
> >> > I guess we might have some misunderstanding. Actually I was not
> >> > against the previous proposal Till suggested before, and I think it is
> >> > a formal way to do that.
> >> >
> >> > And my previous proposal was not for excluding the ShuffleService
> >> > completely. The ShuffleService can be regarded as a factory for
> >> > creating ShuffleMaster on JM/RM side and creating ShuffleEnvironment
> >> > on TE side.
> >> >
> >> >  *
> >> > For the ShuffleEnvironment on TE side: I do not have concerns
> >> > always. The TE receives RPC call for deleting local/global
> >> > partitions and then handle them via ShuffleEnvironment, just the
> >> > similar way as local partitions now.
> >> >  *
> >> > For the ShuffleMaster side: I saw some previous disuccsions on
> >> > multiple ShuffleMaster instances run in different components. I
> >> > was not against this way in essence, but only wonder it might
> >> > bring this feature complex to consider that. So my proposal was
> >> > only for excluding ShuffleMaster if possible to make
> >> > implementation a bit easy. I thought there might have a somewhat
> >> > PartitionTracker component in RM for tracking/deleting global
> >> > partitions, just as we did the way now in JM. The partition state
> >> > is reported from TE and maintained in PartitionTracker of RM, and
> >> > the PartitionTracker could trigger global partition release with
> >> > TE gateway directly, and not further via ShuffleMaster(it is also
> >> > stateless now). And actually in existing PartitionTrackerImpl in
> >> > JM, the PRC call on TE#releasePartitions is also triggered not via
> >> > ShuffleMaster in some cases, and it can be regareded as a shortcut
> >> > way. Of course I am also in favour of via ShuffleMaster to call
> >> > the actual release partition always, and the form seems elegant.
> >> >
> >> > I do not expect my inconsequential thought would block this feature
> >> > ongoing and disturb your previous conclusion. Moreover, Till's recent
> >> > reply already dispels my previous concern. :)
> >> >
> >> > Best,
> >> > Zhijiang
> >> >
> >> > --

[jira] [Created] (FLINK-14470) Watermark display not working with high parallelism job

2019-10-20 Thread Thomas Weise (Jira)
Thomas Weise created FLINK-14470:


 Summary: Watermark display not working with high parallelism job
 Key: FLINK-14470
 URL: https://issues.apache.org/jira/browse/FLINK-14470
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Web Frontend
Affects Versions: 1.9.1, 1.8.2
Reporter: Thomas Weise


Watermarks don't display in the UI when the job graph has many vertices. The 
REST API call to fetch currentInputWatermark fails as it enumerates each 
subtask, which obviously fails beyond a limit. With the new UI that can be 
noticed through a flickering error message in the upper right corner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] FLINK-67: Cluster partition life-cycle

2019-10-20 Thread Zhu Zhu
Thanks Chesnay for proposing this improvement!
It's meaningful for interactive query and the design looks good.

+1 (non-binding)

Thanks,
Zhu Zhu

Till Rohrmann  于2019年10月17日周四 下午8:48写道:

> Thanks for creating this FLIP and starting the VOTE for it.
>
> +1 (binding)
>
> Cheers,
> Till
>
> On Thu, Oct 17, 2019 at 5:08 AM Yun Gao 
> wrote:
>
> > Hi Chesnay,
> >
> >  +1 (non-binding).
> >
> >  Very thanks for driving this.
> >
> >   Best,
> >  Yun
> >
> >
> > --
> > From:Chesnay Schepler 
> > Send Time:2019 Oct. 16 (Wed.) 21:01
> > To:dev@flink.apache.org 
> > Subject:[VOTE] FLINK-67: Cluster partition life-cycle
> >
> > Hello,
> >
> > please vote for FLIP-67
> > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-67%3A+Cluster+partitions+lifecycle
> > >.
> >
> > The discussion thread can be found here
> > <
> >
> https://mail-archives.apache.org/mod_mbox/flink-dev/201909.mbox/%3C58db5a13-2b27-e853-bbf1-ccbf404abc7e%40apache.org%3E
> > >.
> >
> > This vote will be open for at least 72 hours and requires consensus to
> > be accepted.
> >
> > Regards,
> > Chesnay
> >
> >
>


Re: [ANNOUNCE] Apache Flink 1.9.1 released

2019-10-20 Thread Thomas Weise
Filed as https://issues.apache.org/jira/browse/FLINK-14470



On Sun, Oct 20, 2019 at 7:09 PM Yadong Xie  wrote:

> Hi Thomas Weise
>
> We have an issue related to this problem
> https://issues.apache.org/jira/browse/FLINK-14147, but I think you could
> open another JIRA
> The watermark data comes from metric REST API, which is too heavy for high
> parallelism jobs, the API almost sure will break when the parallelism is
> too high
> We could hide the error message as a temp solution and refactor the API to
> solve this issue completely later.
>
> Thomas Weise  于2019年10月21日周一 上午1:31写道:
>
> > Thanks Jark and everyone who contributed to the release!
> >
> > I took 1.9.1 for a test drive, upgrading an existing 1.8.x deployment.
> This
> > brought in the shiny new UI :)
> >
> > Everything looks good so far, except an error message that is flickering
> on
> > the job overview page: "Server response message: unable to load requested
> > file.."
> >
> > Looking at what is being requested makes me gasp..
> >
> >
> >1.
> >
> >
> http://somehost/jobs/047082c3386e250a1e0607360a003022/vertices/5995a65cc0e8d66f0ffa0d05c183c8f8/metrics?get=0.currentInputWatermark,1.currentInputWatermark,2.currentInputWatermark,3.currentInputWatermark,4.currentInputWatermark,5.currentInputWatermark,6.currentInputWatermark,7.currentInputWatermark,8.currentInputWatermark,9.currentInputWatermark,10.currentInputWatermark,11.currentInputWatermark,12.currentInputWatermark,13.currentInputWatermark,14.currentInputWatermark,15.currentInputWatermark,16.currentInputWatermark,17.currentInputWatermark,18.currentInputWatermark,19.currentInputWatermark,20.currentInputWatermark,21.currentInputWatermark,22.currentInputWatermark,23.currentInputWatermark,24.currentInputWatermark,25.currentInputWatermark,26.currentInputWatermark,27.currentInputWatermark,28.currentInputWatermark,29.currentInputWatermark,30.currentInputWatermark
> ,
> >  >   LOT OF STUFF>,255.currentInputWatermark
> >   2.
> >
> > This just won't work for high parallelism jobs. The watermark display was
> > already broken with the old UI, thankfully now it's a bit more obvious.
> Is
> > this a known/previously reported issue for 1.9.x?
> >
> > Thanks,
> > Thomas
> >
> >
> > On Sat, Oct 19, 2019 at 6:24 PM Hequn Cheng 
> wrote:
> >
> > > Thanks a lot to Jark, Jincheng, and everyone that make this release
> > > possible.
> > >
> > > Best, Hequn
> > >
> > > On Sat, Oct 19, 2019 at 10:29 PM Zili Chen 
> wrote:
> > >
> > > > Thanks a lot for being release manager Jark. Great work!
> > > >
> > > > Best,
> > > > tison.
> > > >
> > > >
> > > > Till Rohrmann  于2019年10月19日周六 下午10:15写道:
> > > >
> > > >> Thanks a lot for being our release manager Jark and thanks to
> everyone
> > > >> who has helped to make this release possible.
> > > >>
> > > >> Cheers,
> > > >> Till
> > > >>
> > > >> On Sat, Oct 19, 2019 at 3:26 PM Jark Wu  wrote:
> > > >>
> > > >>>  Hi,
> > > >>>
> > > >>> The Apache Flink community is very happy to announce the release of
> > > >>> Apache Flink 1.9.1, which is the first bugfix release for the
> Apache
> > > Flink
> > > >>> 1.9 series.
> > > >>>
> > > >>> Apache Flink® is an open-source stream processing framework for
> > > >>> distributed, high-performing, always-available, and accurate data
> > > streaming
> > > >>> applications.
> > > >>>
> > > >>> The release is available for download at:
> > > >>> https://flink.apache.org/downloads.html
> > > >>>
> > > >>> Please check out the release blog post for an overview of the
> > > >>> improvements for this bugfix release:
> > > >>> https://flink.apache.org/news/2019/10/18/release-1.9.1.html
> > > >>>
> > > >>> The full release notes are available in Jira:
> > > >>> https://issues.apache.org/jira/projects/FLINK/versions/12346003
> > > >>>
> > > >>> We would like to thank all contributors of the Apache Flink
> community
> > > >>> who helped to verify this release and made this release possible!
> > > >>> Great thanks to @Jincheng for helping finalize this release.
> > > >>>
> > > >>> Regards,
> > > >>> Jark Wu
> > > >>>
> > > >>
> > >
> >
>


Re: Watermarks not propagated to WebUI?

2019-10-20 Thread Thomas Weise
The issue persists with 1.9.1:

https://issues.apache.org/jira/browse/FLINK-14470


On Mon, Aug 26, 2019 at 1:47 AM Jan Lukavský  wrote:

> Hi Robert,
>
> I'd very much love to, but because I run my pipeline with Beam, I'm
> afraid I will have to wait a little longer, before Beam has runner for
> 1.9 [1]. I'm pretty sure that the watermarks disappeared with overall
> parallelism (over all operators) something above 2000. There was quite a
> lot of operators (shuffling), so the individual parallelism of each
> operator was about 200. The pipeline was spread over 50 taskmanager
> (each having 4 slots).
>
> Jan
>
> [1] https://github.com/apache/beam/pull/9296/
>
> On 8/26/19 10:23 AM, Robert Metzger wrote:
> > Jan, will you be able to test this issue on the now-released Flink 1.9
> > with the new UI?
> >
> > What parallelism is needed to reproduce the issue?
> >
> >
> > On Thu, Aug 15, 2019 at 1:59 PM Chesnay Schepler  > > wrote:
> >
> > I remember an issue regarding the watermark fetch request from the
> > WebUI
> > exceeding some HTTP size limit, since it tries to fetch all
> > watermarks
> > at once, and the format of this request isn't exactly efficient.
> >
> > Querying metrics for individual operators still works since the
> > request
> > is small enough.
> >
> > Not sure whether we ever fixed that.
> >
> > On 15/08/2019 12:01, Jan Lukavský wrote:
> > > Hi,
> > >
> > > Thomas, thanks for confirming this. I have noticed, that in 1.9 the
> > > WebUI has been reworked a lot, does anyone know if this is still an
> > > issue? I currently cannot easily try 1.9, so I cannot confirm or
> > > disprove that.
> > >
> > > Jan
> > >
> > > On 8/14/19 6:25 PM, Thomas Weise wrote:
> > >> I have also noticed this issue (Flink 1.5, Flink 1.8), and it
> > appears
> > >> with
> > >> higher parallelism.
> > >>
> > >> This can be confusing to the user when watermarks actually work
> > and
> > >> can be
> > >> observed using the metrics.
> > >>
> > >> On Wed, Aug 14, 2019 at 7:36 AM Jan Lukavský  > > wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> is it possible, that watermarks are sometimes not propagated
> > to WebUI,
> > >>> although they are internally moving as normal? I see in WebUI
> > every
> > >>> operator showing "No Watermark", but outputs seem to be
> > propagated to
> > >>> sink (and there are watermark sensitive operations involved -
> e.g.
> > >>> reductions on fixed windows without early emitting). More
> > strangely,
> > >>> this happens when I increase parallelism above some threshold.
> > If I use
> > >>> parallelism of N, watermarks are shown, when I increase it
> > above some
> > >>> number (seems not to be exactly deterministic), watermarks
> > seems to
> > >>> disappear.
> > >>>
> > >>> I'm using Flink 1.8.1.
> > >>>
> > >>> Did anyone experience something like this before?
> > >>>
> > >>> Jan
> > >>>
> > >>>
> > >
> >
>


[jira] [Created] (FLINK-14471) Hide error message when metric api failed

2019-10-20 Thread Yadong Xie (Jira)
Yadong Xie created FLINK-14471:
--

 Summary: Hide error message when metric api failed
 Key: FLINK-14471
 URL: https://issues.apache.org/jira/browse/FLINK-14471
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Web Frontend
Affects Versions: 1.9.1
Reporter: Yadong Xie
 Fix For: 1.10.0


The error message should hide when metric api failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] FLIP-72: Introduce Pulsar Connector

2019-10-20 Thread Yijie Shen
Hi everyone,

Glad to receive your valuable feedbacks.

I'd first separate the Pulsar catalog as another doc and show more design
and implementation details there.

For the current FLIP-72, I would separate it into the sink part for current
work and keep the source part as future works until we reach FLIP-27 finals.

I also reply to some of the comments in the design doc. I will rewrite the
catalog part in regarding to Bowen's advice in both email and comments.

Thanks for the help again.

Best,
Yijie

On Fri, Oct 18, 2019 at 12:40 AM Rong Rong  wrote:

> Hi Yijie,
>
> I also agree with Jark on separating the Catalog part into another FLIP.
>
> With FLIP-27[1] also in the air, it is also probably great to split and
> unblock the sink implementation contribution.
> I would suggest either putting in a detail implementation plan section in
> the doc, or (maybe too much separation?) splitting them into different
> FLIPs. What do you guys think?
>
> --
> Rong
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface
>
> On Wed, Oct 16, 2019 at 9:00 PM Jark Wu  wrote:
>
> > Hi Yijie,
> >
> > Thanks for the design document. I agree with Bowen that the catalog part
> > needs more details.
> > And I would suggest to separate Pulsar Catalog as another FLIP. IMO, it
> has
> > little to do with source/sink.
> > Having a separate FLIP can unblock the contribution for sink (or source)
> > and keep the discussion more focus.
> > I also left some comments in the documentation.
> >
> > Thanks,
> > Jark
> >
> > On Thu, 17 Oct 2019 at 11:24, Yijie Shen 
> > wrote:
> >
> > > Hi Bowen,
> > >
> > > Thanks for your comments. I'll add catalog details as you suggested.
> > >
> > > One more question: since we decide to not implement source part of the
> > > connector at the moment.
> > > What can users do with a Pulsar catalog?
> > > Create a table backed by Pulsar and check existing pulsar tables to see
> > > their schemas? Drop tables maybe?
> > >
> > > Best,
> > > Yijie
> > >
> > > On Thu, Oct 17, 2019 at 1:04 AM Bowen Li  wrote:
> > >
> > > > Hi Yijie,
> > > >
> > > > Per the discussion, maybe you can move pulsar source to 'future work'
> > > > section in the FLIP for now?
> > > >
> > > > Besides, the FLIP seems to be quite rough at the moment, and I'd
> > > recommend
> > > > to add more details .
> > > >
> > > > A few questions mainly regarding the proposed pulsar catalog.
> > > >
> > > >- Can you provide some background of pulsar schema registry and
> how
> > it
> > > >works?
> > > >- The proposed design of pulsar catalog is very vague now, can you
> > > >share some details of how a pulsar catalog would work internally?
> > E.g.
> > > >   - which APIs does it support exactly? E.g. I see from your
> > > >   prototype that table creation is supported but not alteration.
> > > >   - is it going to connect to a pulsar schema registry via a http
> > > >   client or a pulsar client, etc
> > > >   - will it be able to handle multiple versions of pulsar, or
> just
> > > >   one? How is compatibility handles between different
> Flink-Pulsar
> > > versions?
> > > >   - will it support only reading from pulsar schema registry , or
> > > >   both read/write? Will it work end-to-end in Flink SQL for users
> > to
> > > create
> > > >   and manipulate a pulsar table such as "CREATE TABLE t WITH
> > > >   PROPERTIES(type=pulsar)" and "DROP TABLE t"?
> > > >   - Is a pulsar topic always gonna be a non-partitioned table?
> How
> > is
> > > >   a partitioned topic mapped to a Flink table?
> > > >- How to map Flink's catalog/database namespace to pulsar's
> > > >multi-tenant namespaces? I'm not very familiar with how multi
> > tenancy
> > > works
> > > >in pulsar, and some background context/use cases may help here
> too.
> > > E.g.
> > > >   - can a pulsar client/consumer/producer be multiple-tenant at
> the
> > > >   same time?
> > > >   - how does authentication work in pulsar's multi-tenancy and
> the
> > > >   catalog? asking since I didn't see the proposed pulsar catalog
> > has
> > > >   username/password configs
> > > >   - the FLIP seems propose mapping a pulsar cluster and
> > > >   'tenant/namespace' respectively to Flink's 'catalog' and
> > > 'database'. I
> > > >   wonder whether it totally makes sense, or should we actually
> map
> > > "tenant"
> > > >   to "catalog", and "namespace" to "database"?
> > > >
> > > > Cheers,
> > > > Bowen
> > > >
> > > > On Fri, Sep 20, 2019 at 1:16 AM Yijie Shen <
> henry.yijies...@gmail.com>
> > > > wrote:
> > > >
> > > >> Hi everyone,
> > > >>
> > > >> Per discussion in the previous thread
> > > >> <
> > > >>
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Contribute-Pulsar-Flink-connector-back-to-Flink-tc32538.html
> > > >> >,
> > > >> I have created FLIP-72 to kick off a more detailed discussion on the
> > > Fli

Re: [DISCUSS] Improve Flink logging with contextual information

2019-10-20 Thread Yang Wang
+1 to Rong’s approach.

Using java option and log4j, we could save the user logs to different file.

Best
Yang

Gyula Fóra  于2019年10月18日周五 下午4:41写道:

> Hi all!
>
> Thanks for the answers, this has been very helpful and we could set up a
> similar scheme using the Env variables.
>
> Cheers,
> Gyula
>
> On Tue, Oct 15, 2019 at 9:55 AM Paul Lam  wrote:
>
> > +1 to Rong’s approach. We use a similar solution to the log context
> > problem
> > on YARN setups. FYI.
> >
> > WRT container contextual informations, we collection logs via ELK so that
> > the log file paths (which contains application id and container id) and
> > the host
> > are attached with the logs. But if you don’t want a new log collector,
> you
> > can
> > also use the system env variables in your log pattern. Flink sets the
> > container
> > informations into the system env variables, which could be found in the
> > container
> > launch script.
> >
> > WRT job contextual informations, we’ve tried MDC on task threads but it
> > ended
> > up with poor readability because Flink system threads are not set with
> the
> > MDC
> > variables (in my case user info), so now we use user name in system env
> as
> > the logger pattern variable instead. However, for job id/name, I’m afraid
> > that
> > they can not be found in the default system env variables. You may need
> > to find a way to set them into the system env or system properties.
> >
> > Best,
> > Paul Lam
> >
> > > 在 2019年10月15日,12:50,Rong Rong  写道:
> > >
> > > Hi Gyula,
> > >
> > > Sorry for the late reply. I think it is definitely a challenge in terms
> > of
> > > log visibility.
> > > However, for your requirement I think you can customize your Flink job
> by
> > > utilizing a customized log formatter/encoder (e.g. log4j.properties or
> > > logback.xml) and a suitable logger implementation.
> > >
> > > One example you can follow is to provide customFields in your log
> > encoding
> > > [1,2] and utilizing a supported Appender to append your log to a file.
> > > You can also utilize a more customized appender to log the data into
> some
> > > external database (for example, ElasticSearch and access via Kibana).
> > >
> > > One challenge you might face is how to configure these contextual
> > > information dynamically. In our setup, these contextual information are
> > > configured as system env params when job launches. so loggers can
> > > dynamically resolve them during start time.
> > >
> > > Please let me know if any of the suggestions above helps.
> > >
> > > Cheers,
> > > Rong
> > >
> > > [1]
> > >
> >
> https://github.com/logstash/logstash-logback-encoder/blob/master/src/test/resources/logback-test.xml#L13
> > > [2] https://github.com/logstash/logstash-logback-encoder
> > >
> > > On Thu, Oct 3, 2019 at 1:56 AM Gyula Fóra 
> wrote:
> > >
> > >> Hi all!
> > >>
> > >> We have been thinking that it would be a great improvement to add
> > >> contextual information to the Flink logs:
> > >>
> > >> - Container / yarn / host info to JM/TM logs
> > >> - Job info (job id/ jobname) to task logs
> > >>
> > >> I this should be similar to how the metric scopes are set up and
> should
> > be
> > >> able to provide the same information for logs. Ideally it would be
> user
> > >> configurable.
> > >>
> > >> We are wondering what would be the best way to do this, and would like
> > to
> > >> ask for opinions or past experiences.
> > >>
> > >> Our natural first thought was setting NDC / MDC in the different
> threads
> > >> but it seems to be a somewhat fragile mechanism as it can be easily
> > >> "cleared" or deleted by the user.
> > >>
> > >> What do you think?
> > >>
> > >> Gyula
> > >>
> >
> >
>