Re: [ANNOUNCE] Yu Li became a Flink committer

2020-01-28 Thread Yun Tang

Congratulations Yu, well deserved and happy Chinese new year ~


Best
Yun Tang

From: Xintong Song 
Sent: Tuesday, January 28, 2020 11:24
To: dev 
Subject: Re: [ANNOUNCE] Yu Li became a Flink committer

Congratulations Yu, well deserved.

Thank you~

Xintong Song



On Mon, Jan 27, 2020 at 10:15 PM Jark Wu  wrote:

> Congratulations Yu!
>
> Best,
> Jark
>
> On Sat, 25 Jan 2020 at 17:27, Leonard Xu  wrote:
>
> > Congratulations!!
> >
> > Best,
> > Leonard Xu
> >
> > > 在 2020年1月24日,21:17,张光辉  写道:
> > >
> > > Congratulations!!!
> > >
> > > Guowei Ma  于2020年1月24日周五 下午7:30写道:
> > >
> > >> Congratulations
> > >> Best,
> > >> Guowei
> > >>
> > >>
> > >> 刘建刚  于2020年1月24日周五 下午5:56写道:
> > >>
> > >>> Congratulations!
> > >>>
> >  2020年1月23日 下午4:59,Stephan Ewen  写道:
> > 
> >  Hi all!
> > 
> >  We are announcing that Yu Li has joined the rank of Flink
> committers.
> > 
> >  Yu joined already in late December, but the announcement got lost
> > >> because
> >  of the Christmas and New Years season, so here is a belated proper
> >  announcement.
> > 
> >  Yu is one of the main contributors to the state backend components
> in
> > >> the
> >  recent year, working on various improvements, for example the
> RocksDB
> >  memory management for 1.10.
> >  He has also been one of the release managers for the big 1.10
> release.
> > 
> >  Congrats for joining us, Yu!
> > 
> >  Best,
> >  Stephan
> > >>>
> > >>>
> > >>
> >
> >
>


[jira] [Created] (FLINK-15778) SQL Client end-to-end test for Kafka 0.10 nightly run hung on travis

2020-01-28 Thread Yu Li (Jira)
Yu Li created FLINK-15778:
-

 Summary: SQL Client end-to-end test for Kafka 0.10 nightly run 
hung on travis
 Key: FLINK-15778
 URL: https://issues.apache.org/jira/browse/FLINK-15778
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka, Table SQL / Client
Affects Versions: 1.11.0
Reporter: Yu Li
 Fix For: 1.11.0


The "SQL Client end-to-end test for Kafka 0.10" end-to-end test hung on travis:
{noformat}
Waiting for broker...
Waiting for broker...


The job exceeded the maximum time limit for jobs, and has been terminated.
{noformat}

https://api.travis-ci.org/v3/job/642477196/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15779) Manual Smoke Test of Python API

2020-01-28 Thread Gary Yao (Jira)
Gary Yao created FLINK-15779:


 Summary: Manual Smoke Test of Python API
 Key: FLINK-15779
 URL: https://issues.apache.org/jira/browse/FLINK-15779
 Project: Flink
  Issue Type: Task
  Components: API / Python
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


Try out the [Python 
API|https://ci.apache.org/projects/flink/flink-docs-release-1.10/tutorials/python_table_api.html],
 and report any usability issues.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15780) Manual Smoke Test of Native Kubernetes Integration

2020-01-28 Thread Gary Yao (Jira)
Gary Yao created FLINK-15780:


 Summary: Manual Smoke Test of Native Kubernetes Integration
 Key: FLINK-15780
 URL: https://issues.apache.org/jira/browse/FLINK-15780
 Project: Flink
  Issue Type: Task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


Try out the [native Kubernetes 
integration|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html],
 and report any usability issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15781) Manual Smoke Test with Batch Job

2020-01-28 Thread Gary Yao (Jira)
Gary Yao created FLINK-15781:


 Summary: Manual Smoke Test with Batch Job
 Key: FLINK-15781
 URL: https://issues.apache.org/jira/browse/FLINK-15781
 Project: Flink
  Issue Type: Task
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.10.0


Try out a (larger) scale batch job with {{"jobmanager.scheduler: ng"}} enabled 
in {{flink-conf.yaml}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: connection timeout during shuffle initialization

2020-01-28 Thread Till Rohrmann
Hi David,

I'm unfortunately not familiar with these parts of Flink but I'm pulling
Piotr in who might be able to tell you more.

Cheers,
Till

On Mon, Jan 27, 2020 at 5:58 PM David Morávek  wrote:

> Hello community,
>
> I'm currently struggling with an Apache Beam batch pipeline on top of
> Flink. The pipeline runs smoothly in smaller environments, but in
> production it always ends up with `connection timeout` in one of the last
> shuffle phases.
>
> org.apache.flink.runtime.io
> .network.partition.consumer.PartitionConnectionException:
> Connection for partition
> 260ddc26547df92babd1c6d430903b9d@da5da68d6d4600bb272d68172a09760f not
> reachable.
> at org.apache.flink.runtime.io
> .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168)
> at org.apache.flink.runtime.io
> .network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:237)
> at org.apache.flink.runtime.io
> .network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:215)
> at
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65)
> at
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:866)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:621)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
> at java.lang.Thread.run(Thread.java:748)
> ...
> Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
> Connection timed out: ##/10.249.28.39:25709
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at ...
>
> Basically the pipeline looks as follows:
>
> read "skewed" sources -> reshuffle -> flatMap (performs a heavy computation
> - few secs per element) -> write to multiple outputs (~4)
>
> - cluster size: 100 tms
> - slots per tm: 4
> - data size per single job run ranging from 100G to 1TB
> - job paralelism: 400
>
> I've tried to increase netty
> `taskmanager.network.netty.client.connectTimeoutSec` with no luck. Also
> increasing # of netty threads did not help. JVM performs ok (no ooms, gc
> pauses, ...). Connect backlog defaults to 128.
>
> This is probably caused by netty threads being blocked on the server side.
> All these threads share the same lock, so increasing number of threads
> won't help.
>
> "Flink Netty Server (0) Thread 2" #131 daemon prio=5 os_prio=0
> tid=0x7f339818e800 nid=0x7e9 runnable [0x7f334acf5000]
>java.lang.Thread.State: RUNNABLE
> at java.lang.Number.(Number.java:55)
> at java.lang.Long.(Long.java:947)
> at
>
> sun.reflect.UnsafeLongFieldAccessorImpl.get(UnsafeLongFieldAccessorImpl.java:36)
> at java.lang.reflect.Field.get(Field.java:393)
> at
>
> org.apache.flink.core.memory.HybridMemorySegment.getAddress(HybridMemorySegment.java:420)
> at
>
> org.apache.flink.core.memory.HybridMemorySegment.checkBufferAndGetAddress(HybridMemorySegment.java:434)
> at
>
> org.apache.flink.core.memory.HybridMemorySegment.(HybridMemorySegment.java:81)
> at
>
> org.apache.flink.core.memory.HybridMemorySegment.(HybridMemorySegment.java:66)
> at
>
> org.apache.flink.core.memory.MemorySegmentFactory.wrapOffHeapMemory(MemorySegmentFactory.java:130)
> at
> org.apache.flink.runtime.io
> .network.partition.BufferReaderWriterUtil.sliceNextBuffer(BufferReaderWriterUtil.java:87)
> at
> org.apache.flink.runtime.io
> .network.partition.MemoryMappedBoundedData$BufferSlicer.nextBuffer(MemoryMappedBoundedData.java:240)
> at
> org.apache.flink.runtime.io
> .network.partition.BoundedBlockingSubpartitionReader.(BoundedBlockingSubpartitionReader.java:71)
> at
> org.apache.flink.runtime.io
> .network.partition.BoundedBlockingSubpartition.createReadView(BoundedBlockingSubpartition.java:201)
> - locked <0x0006d822e180> (a java.lang.Object)
> at
> org.apache.flink.runtime.io
> .network.partition.ResultPartition.createSubpartitionView(ResultPartition.java:279)
> at
> org.apache.flink.runtime.io
> .network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:72)
> - locked <0x0006cad32578> (a java.util.HashMap)
> at
> org.apache.flink.runtime.io
> .network.netty.CreditBasedSequenceNumberingViewReader.requestSubpartitionView(CreditBasedSequenceNumberingViewReader.java:86)
> - locked <0x00079767ff38> (a java.lang.Object)
> at
> org.apache.flink.runtime.io
> .network.netty.PartitionRequestServerHandler.channelRead0(PartitionRequestServerHandler.java:102)
> at
> org.apache.flink.runtime.io
> .network.netty.PartitionRequestServerHandler.channelRead0(PartitionRequestServerHandler.java:42)
> at
>
> org.apache.flink.shaded.netty4.io.netty.channel

Re: [VOTE] Integrate Flink Docker image publication into Flink release process

2020-01-28 Thread Till Rohrmann
+1 (binding)

Cheers,
Till

On Mon, Jan 27, 2020 at 7:17 PM Konstantin Knauf 
wrote:

> +1 (non-binding)
>
> On Mon, Jan 27, 2020 at 5:50 PM Ufuk Celebi  wrote:
>
> > Hey all,
> >
> > there is a proposal to contribute the Dockerfiles and scripts of
> > https://github.com/docker-flink/docker-flink to the Flink project. The
> > discussion corresponding to this vote outlines the reasoning for the
> > proposal and can be found here: [1].
> >
> > The proposal is as follows:
> > * Request a new repository apache/flink-docker
> > * Migrate all files from docker-flink/docker-flink to apache/flink-docker
> > * Update the release documentation to describe how to update
> > apache/flink-docker for new releases
> >
> > Please review and vote on this proposal as follows:
> > [ ] +1, Approve the proposal
> > [ ] -1, Do not approve the proposal (please provide specific comments)
> >
> > The vote will be open for at least 3 days, ending the earliest on:
> January
> > 30th 2020, 17:00 UTC.
> >
> > Cheers,
> >
> > Ufuk
> >
> > PS: I'm treating this proposal similar to a "Release Plan" as mentioned
> in
> > the project bylaws [2]. Please let me know if you consider this a
> different
> > category.
> >
> > [1]
> >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36139.html
> > [2]
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026
> >
>
>
> --
>
> Konstantin Knauf | Solutions Architect
>
> +49 160 91394525
>
>
> Follow us @VervericaData Ververica 
>
>
> --
>
> Join Flink Forward  - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Tony) Cheng
>


[jira] [Created] (FLINK-15782) Rework JDBC sinks

2020-01-28 Thread Roman Khachatryan (Jira)
Roman Khachatryan created FLINK-15782:
-

 Summary: Rework JDBC sinks
 Key: FLINK-15782
 URL: https://issues.apache.org/jira/browse/FLINK-15782
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / JDBC
Reporter: Roman Khachatryan


# refactor current code for re-use in a new exactly once JDBC sink

 # expose existing sinks at the datastream level (for users, who don’t need 
exactly-once OR for whom upsert is a better option)

existing API shouldn’t change (mostly table API)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: connection timeout during shuffle initialization

2020-01-28 Thread Piotr Nowojski
Hi David,

The usual cause for connection time out is long deployment. For example if your 
Job's jar is large and takes long time to distribute across the cluster. I’m 
not sure if large state could affect this as well or not. Are you sure that’s 
not the case?

The think you are suggesting, I haven’t heard about previously, but indeed 
theoretically it could happen. Reading from mmap’ed sub partitions could block 
the Netty threads if kernel decides to drop mmap’ed page and it has to be read 
from the disks. Could you check your CPU and disks IO usage? This should be 
visible by high IOWait CPU usage. Could you for example post couple of sample 
results of 

iostat -xm 2

command from some representative Task Manager? If indeed disks are overloaded, 
changing Flink’s config option

taskmanager.network.bounded-blocking-subpartition-type

From default to `file` could solve the problem. FYI, this option is renamed in 
1.10 to

taskmanager.network.blocking-shuffle.type

And it’s default value will be `file`.

We would appreciate if you could get back to us with the results!

Piotrek

> On 28 Jan 2020, at 11:03, Till Rohrmann  wrote:
> 
> Hi David,
> 
> I'm unfortunately not familiar with these parts of Flink but I'm pulling
> Piotr in who might be able to tell you more.
> 
> Cheers,
> Till
> 
> On Mon, Jan 27, 2020 at 5:58 PM David Morávek  wrote:
> 
>> Hello community,
>> 
>> I'm currently struggling with an Apache Beam batch pipeline on top of
>> Flink. The pipeline runs smoothly in smaller environments, but in
>> production it always ends up with `connection timeout` in one of the last
>> shuffle phases.
>> 
>> org.apache.flink.runtime.io
>> .network.partition.consumer.PartitionConnectionException:
>> Connection for partition
>> 260ddc26547df92babd1c6d430903b9d@da5da68d6d4600bb272d68172a09760f not
>> reachable.
>>at org.apache.flink.runtime.io
>> .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168)
>>at org.apache.flink.runtime.io
>> .network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:237)
>>at org.apache.flink.runtime.io
>> .network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:215)
>>at
>> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65)
>>at
>> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:866)
>>at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:621)
>>at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>>at java.lang.Thread.run(Thread.java:748)
>> ...
>> Caused by:
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>> Connection timed out: ##/10.249.28.39:25709
>>at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>>at ...
>> 
>> Basically the pipeline looks as follows:
>> 
>> read "skewed" sources -> reshuffle -> flatMap (performs a heavy computation
>> - few secs per element) -> write to multiple outputs (~4)
>> 
>> - cluster size: 100 tms
>> - slots per tm: 4
>> - data size per single job run ranging from 100G to 1TB
>> - job paralelism: 400
>> 
>> I've tried to increase netty
>> `taskmanager.network.netty.client.connectTimeoutSec` with no luck. Also
>> increasing # of netty threads did not help. JVM performs ok (no ooms, gc
>> pauses, ...). Connect backlog defaults to 128.
>> 
>> This is probably caused by netty threads being blocked on the server side.
>> All these threads share the same lock, so increasing number of threads
>> won't help.
>> 
>> "Flink Netty Server (0) Thread 2" #131 daemon prio=5 os_prio=0
>> tid=0x7f339818e800 nid=0x7e9 runnable [0x7f334acf5000]
>>   java.lang.Thread.State: RUNNABLE
>>at java.lang.Number.(Number.java:55)
>>at java.lang.Long.(Long.java:947)
>>at
>> 
>> sun.reflect.UnsafeLongFieldAccessorImpl.get(UnsafeLongFieldAccessorImpl.java:36)
>>at java.lang.reflect.Field.get(Field.java:393)
>>at
>> 
>> org.apache.flink.core.memory.HybridMemorySegment.getAddress(HybridMemorySegment.java:420)
>>at
>> 
>> org.apache.flink.core.memory.HybridMemorySegment.checkBufferAndGetAddress(HybridMemorySegment.java:434)
>>at
>> 
>> org.apache.flink.core.memory.HybridMemorySegment.(HybridMemorySegment.java:81)
>>at
>> 
>> org.apache.flink.core.memory.HybridMemorySegment.(HybridMemorySegment.java:66)
>>at
>> 
>> org.apache.flink.core.memory.MemorySegmentFactory.wrapOffHeapMemory(MemorySegmentFactory.java:130)
>>at
>> org.apache.flink.runtime.io
>> .network.partition.BufferReaderWriterUtil.sliceNextBuffer(BufferReaderWriterUtil.java:87)
>>at
>> org.apache.flink.runtime.io
>> .network.partition.MemoryMappedBoundedData$BufferSlicer.nextBuffer(MemoryMappedBoundedData.ja

[jira] [Created] (FLINK-15783) Test release candidate of Flink 1.10 on EMR

2020-01-28 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-15783:
--

 Summary: Test release candidate of Flink 1.10 on EMR
 Key: FLINK-15783
 URL: https://issues.apache.org/jira/browse/FLINK-15783
 Project: Flink
  Issue Type: Task
  Components: Tests
Affects Versions: 1.10.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.10.0


Run a simple job, with checkpointing and StreamingFileSink making sure for 
example that https://issues.apache.org/jira/browse/FLINK-15777 is fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15784) Show the Watermark metrics of source node in the web UI.

2020-01-28 Thread Guowei Ma (Jira)
Guowei Ma created FLINK-15784:
-

 Summary: Show the Watermark metrics of source node in the web UI.
 Key: FLINK-15784
 URL: https://issues.apache.org/jira/browse/FLINK-15784
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Web Frontend
Reporter: Guowei Ma


Currently, the web UI only shows the watermark of the non-source node. It might 
mislead some users that the source could not “receive” the watermark.[1]

So it would be more consistent to show all the watermarks metrics included the 
source node in the web UI.

 

[1][https://lists.apache.org/thread.html/r0f565a425253afea67eb2cca8dc1b233b57f87bb71650681aa9b6731%40%3Cuser.flink.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15785) Rework E2E test activations

2020-01-28 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-15785:


 Summary: Rework E2E test activations
 Key: FLINK-15785
 URL: https://issues.apache.org/jira/browse/FLINK-15785
 Project: Flink
  Issue Type: Improvement
  Components: Build System, Test Infrastructure, Travis
Affects Versions: 1.11.0
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.11.0


Problems in the current setup:
- travis activations are spread out over multiple files (.travis.yml, 
travis_watchdoh.sh)
- hadoop exclusions aren't setup (causing FLINK-15685)
- hadoop tests are not excluded by default
- automatic java11 exclusion is prone to not working as intended
- setting multiple categories in .travis.yml is incredibly verbose



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15786) Load connector code with separate classloader

2020-01-28 Thread Guowei Ma (Jira)
Guowei Ma created FLINK-15786:
-

 Summary: Load connector code with separate classloader
 Key: FLINK-15786
 URL: https://issues.apache.org/jira/browse/FLINK-15786
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Task
Reporter: Guowei Ma


Currently, connector code can be seen as part of user code. Usually, users only 
need to add the corresponding connector as a dependency and package it in the 
user jar. This is convenient enough.

However, connectors usually need to interact with external systems and often 
introduce heavy dependencies, there is a high possibility of a class conflict 
of different connectors or the user code of the same job. For example, every 
one or two weeks, we will receive issue reports relevant with connector class 
conflict from our users. The problem can get worse when users want to analyze 
data from different sources and write output to different sinks.

Using separate classloader to load the different connector code could resolve 
the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15787) Upgrade REST API response to remove '-' from key names.

2020-01-28 Thread Daryl Roberts (Jira)
Daryl Roberts created FLINK-15787:
-

 Summary: Upgrade REST API response to remove '-' from key names.
 Key: FLINK-15787
 URL: https://issues.apache.org/jira/browse/FLINK-15787
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / REST
Affects Versions: 1.10.0
Reporter: Daryl Roberts


There are some REST API responses that include keys with hyphens in them.  This 
results in the frontend having to use string-lookups to access those values and 
we lose the type information when doing that.

Example from {{/jobs/vertices//backpressure}}

export interface JobBackpressureInterface {
  status: string;
  'backpressure-level': string;
  'end-timestamp': number;
  subtasks: JobBackpressureSubtaskInterface[];
}

I would like to update all of these to use {{_}} instead so we can maintain the 
type information we have in the web-runtime.

My suggestion to do this with out a version bump to the API is to just make an 
addition to all the enpoints that include the _ versions as well.  Then after 
the web-runtime has completely switched over to using the new keys, you can 
deprecate and remove the old hypenated keys at your own pace.

[~chesnay][~trohrmann]





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Integrate Flink Docker image publication into Flink release process

2020-01-28 Thread Patrick Lucas
Thanks everyone for your input on this!

@Fabian: I concur with utilizing the ASF infra and ASF Docker Hub
organization to build and host any "less-critical" images like you propose.
I would also add RC builds to that list, as alluded to in my original email.

--
Patrick

On Sun, Jan 26, 2020 at 4:28 PM Ufuk Celebi  wrote:

> Thanks all for chiming in. I'll continue tomorrow with a VOTE as suggested
> by Till.
>
> Regarding my initially proposed timeline: I don't think we will have
> everything ready before the first 1.10 RC, but I also think it's not that
> big of a deal. ;-)
>
> – Ufuk
>
>
> On Fri, Jan 24, 2020 at 11:59 AM Till Rohrmann 
> wrote:
>
> > +1 for Ufuk's proposal how to proceed. I guess the immediate next step
> > would be a VOTE for accepting the dockerfiles and where to store them.
> >
> > Cheers,
> > Till
> >
> > On Wed, Jan 22, 2020 at 4:05 PM Fabian Hueske  wrote:
> >
> > > Hi everyone,
> > >
> > > First of all, thank you very much Patrick for maintaining and
> publishing
> > > the Flink Docker images so far and for starting this discussion!
> > >
> > > I'm in favor of adding the Dockerfiles in a separate repository and not
> > in
> > > the main Flink repository.
> > > I also think that it makes sense to first focus on the contribution of
> > the
> > > Dockerfiles and consolidation of existing Dockerfiles before discussing
> > > special cases for development and testing.
> > >
> > > In addition to the Dockerfiles in the Flink main repo, there is also
> one
> > in
> > > the flink-playgrounds repo [1] to build a customized Docker image for
> the
> > > playground.
> > >
> > > Besides building and publishing "official" Flink images via DockerHub,
> > > there is also the option to let ASF Infra build Docker images and
> publish
> > > them under https://hub.docker.com/u/apache.
> > > These images would not be "official" DockerHub images anymore, but
> > > available under the Apache DockerHub user.
> > > However, I think it would be a good idea to keep the current setup for
> > the
> > > main Flink images (those that depend on Flink releases) for better
> > > visibility and to not confuse our users.
> > > We might want to publish less critical images (playground images, dev
> > > images, nightly builds, etc) via Infra under the Apache DockerHub user.
> > >
> > > Best,
> > > Fabian
> > >
> > > Am Mo., 13. Jan. 2020 um 11:38 Uhr schrieb Ufuk Celebi  >:
> > >
> > > > Hey all,
> > > >
> > > > first of all a big thank you for driving many of the Docker image
> > > releases
> > > > in the last two years.
> > > >
> > > > *(1) Moving docker-flink/docker-flink to apache/docker-flink*
> > > >
> > > > +1 to do this as you outlined. I would propose to aim for a first
> > > > integration with the 1.10 release without major changes to the
> existing
> > > > Dockerfiles. The work items would be to move the Dockerfiles and
> update
> > > the
> > > > release process documentation so everyone is on the same page.
> > > >
> > > > *(2) Consolidate Dockerfiles in apache/flink*
> > > >
> > > > +1 to start the process for this. I think this requires a bit of
> > thinking
> > > > about what the requirements are and which problems we want to solve.
> > From
> > > > skimming the existing Dockerfiles, it seems to me that the Docker
> image
> > > > builds fulfil quite a few different tasks. We have a script that can
> > > bundle
> > > > Hadoop, can copy an existing Flink distribution, can include user
> jars,
> > > > etc. The scope of this is quite broad and would warrant a design
> > > document/a
> > > > FLIP.
> > > >
> > > > I would move the questions about nightly builds, using a different
> base
> > > > image or having image variants with debug tooling to after (1) and
> (2)
> > or
> > > > make it part of (2).
> > > >
> > > > *(3) Next steps*
> > > >
> > > > If there are no objections, I would propose to tackle (1) and (2)
> > > separate
> > > > and to continue as follows:
> > > >
> > > > (i) Create tickets for (1) and aim to align with 1.10 release
> timeline
> > > > (ideally before the first RC). Since this does not touch any code in
> > the
> > > > release branches, I think this would not be affected by the feature
> > > freeze.
> > > > The major work item would be to update the docs and potential
> > > refactorings
> > > > of the existing process and Dockerfiles. I can help with the process
> to
> > > > create a new repo.
> > > >
> > > > (ii) Create first draft for consolidation of existing Dockerfiles.
> > After
> > > > this proposal is done, I would propose to bring it up for a separate
> > > > discussion on the ML.
> > > >
> > > >
> > > > What do you think? @Patrick: would you be interested in working on
> both
> > > (1)
> > > > + (2) or did you mainly have (1) in mind?
> > > >
> > > > Best,
> > > >
> > > > Ufuk
> > > >
> > > > On Sun, Jan 12, 2020 at 8:30 PM Konstantin Knauf <
> > > konstan...@ververica.com
> > > > >
> > > > wrote:
> > > >
> > > > > Big +1 for
> > > > >
> > > > > * official images in a separa

Re: [VOTE] Integrate Flink Docker image publication into Flink release process

2020-01-28 Thread Patrick Lucas
Thanks for kicking this off, Ufuk.

+1 (non-binding)

--
Patrick

On Mon, Jan 27, 2020 at 5:50 PM Ufuk Celebi  wrote:

> Hey all,
>
> there is a proposal to contribute the Dockerfiles and scripts of
> https://github.com/docker-flink/docker-flink to the Flink project. The
> discussion corresponding to this vote outlines the reasoning for the
> proposal and can be found here: [1].
>
> The proposal is as follows:
> * Request a new repository apache/flink-docker
> * Migrate all files from docker-flink/docker-flink to apache/flink-docker
> * Update the release documentation to describe how to update
> apache/flink-docker for new releases
>
> Please review and vote on this proposal as follows:
> [ ] +1, Approve the proposal
> [ ] -1, Do not approve the proposal (please provide specific comments)
>
> The vote will be open for at least 3 days, ending the earliest on: January
> 30th 2020, 17:00 UTC.
>
> Cheers,
>
> Ufuk
>
> PS: I'm treating this proposal similar to a "Release Plan" as mentioned in
> the project bylaws [2]. Please let me know if you consider this a different
> category.
>
> [1]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36139.html
> [2]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026
>


Re: [VOTE] Integrate Flink Docker image publication into Flink release process

2020-01-28 Thread Stephan Ewen
+1

On Tue, Jan 28, 2020 at 2:20 PM Patrick Lucas  wrote:

> Thanks for kicking this off, Ufuk.
>
> +1 (non-binding)
>
> --
> Patrick
>
> On Mon, Jan 27, 2020 at 5:50 PM Ufuk Celebi  wrote:
>
> > Hey all,
> >
> > there is a proposal to contribute the Dockerfiles and scripts of
> > https://github.com/docker-flink/docker-flink to the Flink project. The
> > discussion corresponding to this vote outlines the reasoning for the
> > proposal and can be found here: [1].
> >
> > The proposal is as follows:
> > * Request a new repository apache/flink-docker
> > * Migrate all files from docker-flink/docker-flink to apache/flink-docker
> > * Update the release documentation to describe how to update
> > apache/flink-docker for new releases
> >
> > Please review and vote on this proposal as follows:
> > [ ] +1, Approve the proposal
> > [ ] -1, Do not approve the proposal (please provide specific comments)
> >
> > The vote will be open for at least 3 days, ending the earliest on:
> January
> > 30th 2020, 17:00 UTC.
> >
> > Cheers,
> >
> > Ufuk
> >
> > PS: I'm treating this proposal similar to a "Release Plan" as mentioned
> in
> > the project bylaws [2]. Please let me know if you consider this a
> different
> > category.
> >
> > [1]
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36139.html
> > [2]
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026
> >
>


Re: connection timeout during shuffle initialization

2020-01-28 Thread David Morávek
Hi Piotr,

thanks for suggestions!

In case of large jar, wouldn't this happen in previous stages as well (if
so this should not be the case)? Also there shouldn't be any state involved
(unless Beam IO's use it internally).

I'll get back to you with with the results after checking TM's io stats.

D.

On Tue, Jan 28, 2020 at 11:45 AM Piotr Nowojski  wrote:

> Hi David,
>
> The usual cause for connection time out is long deployment. For example if
> your Job's jar is large and takes long time to distribute across the
> cluster. I’m not sure if large state could affect this as well or not. Are
> you sure that’s not the case?
>
> The think you are suggesting, I haven’t heard about previously, but indeed
> theoretically it could happen. Reading from mmap’ed sub partitions could
> block the Netty threads if kernel decides to drop mmap’ed page and it has
> to be read from the disks. Could you check your CPU and disks IO usage?
> This should be visible by high IOWait CPU usage. Could you for example post
> couple of sample results of
>
> iostat -xm 2
>
> command from some representative Task Manager? If indeed disks are
> overloaded, changing Flink’s config option
>
> taskmanager.network.bounded-blocking-subpartition-type
>
> From default to `file` could solve the problem. FYI, this option is
> renamed in 1.10 to
>
> taskmanager.network.blocking-shuffle.type
>
> And it’s default value will be `file`.
>
> We would appreciate if you could get back to us with the results!
>
> Piotrek
>
> > On 28 Jan 2020, at 11:03, Till Rohrmann  wrote:
> >
> > Hi David,
> >
> > I'm unfortunately not familiar with these parts of Flink but I'm pulling
> > Piotr in who might be able to tell you more.
> >
> > Cheers,
> > Till
> >
> > On Mon, Jan 27, 2020 at 5:58 PM David Morávek  wrote:
> >
> >> Hello community,
> >>
> >> I'm currently struggling with an Apache Beam batch pipeline on top of
> >> Flink. The pipeline runs smoothly in smaller environments, but in
> >> production it always ends up with `connection timeout` in one of the
> last
> >> shuffle phases.
> >>
> >> org.apache.flink.runtime.io
> >> .network.partition.consumer.PartitionConnectionException:
> >> Connection for partition
> >> 260ddc26547df92babd1c6d430903b9d@da5da68d6d4600bb272d68172a09760f not
> >> reachable.
> >>at org.apache.flink.runtime.io
> >>
> .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168)
> >>at org.apache.flink.runtime.io
> >>
> .network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:237)
> >>at org.apache.flink.runtime.io
> >>
> .network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:215)
> >>at
> >>
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65)
> >>at
> >>
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:866)
> >>at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:621)
> >>at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
> >>at java.lang.Thread.run(Thread.java:748)
> >> ...
> >> Caused by:
> >>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
> >> Connection timed out: ##/10.249.28.39:25709
> >>at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> >>at
> >> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> >>at ...
> >>
> >> Basically the pipeline looks as follows:
> >>
> >> read "skewed" sources -> reshuffle -> flatMap (performs a heavy
> computation
> >> - few secs per element) -> write to multiple outputs (~4)
> >>
> >> - cluster size: 100 tms
> >> - slots per tm: 4
> >> - data size per single job run ranging from 100G to 1TB
> >> - job paralelism: 400
> >>
> >> I've tried to increase netty
> >> `taskmanager.network.netty.client.connectTimeoutSec` with no luck. Also
> >> increasing # of netty threads did not help. JVM performs ok (no ooms, gc
> >> pauses, ...). Connect backlog defaults to 128.
> >>
> >> This is probably caused by netty threads being blocked on the server
> side.
> >> All these threads share the same lock, so increasing number of threads
> >> won't help.
> >>
> >> "Flink Netty Server (0) Thread 2" #131 daemon prio=5 os_prio=0
> >> tid=0x7f339818e800 nid=0x7e9 runnable [0x7f334acf5000]
> >>   java.lang.Thread.State: RUNNABLE
> >>at java.lang.Number.(Number.java:55)
> >>at java.lang.Long.(Long.java:947)
> >>at
> >>
> >>
> sun.reflect.UnsafeLongFieldAccessorImpl.get(UnsafeLongFieldAccessorImpl.java:36)
> >>at java.lang.reflect.Field.get(Field.java:393)
> >>at
> >>
> >>
> org.apache.flink.core.memory.HybridMemorySegment.getAddress(HybridMemorySegment.java:420)
> >>at
> >>
> >>
> org.apache.flink.core.memory.HybridMemorySegment.checkBufferAndGetAddress(HybridMemorySegment.java:434)
> >>at
> >>
> >>

Re: [VOTE] Integrate Flink Docker image publication into Flink release process

2020-01-28 Thread Yun Tang
+1 (non-binding)

From: Stephan Ewen 
Sent: Tuesday, January 28, 2020 21:36
To: dev ; patr...@ververica.com 
Subject: Re: [VOTE] Integrate Flink Docker image publication into Flink release 
process

+1

On Tue, Jan 28, 2020 at 2:20 PM Patrick Lucas  wrote:

> Thanks for kicking this off, Ufuk.
>
> +1 (non-binding)
>
> --
> Patrick
>
> On Mon, Jan 27, 2020 at 5:50 PM Ufuk Celebi  wrote:
>
> > Hey all,
> >
> > there is a proposal to contribute the Dockerfiles and scripts of
> > https://github.com/docker-flink/docker-flink to the Flink project. The
> > discussion corresponding to this vote outlines the reasoning for the
> > proposal and can be found here: [1].
> >
> > The proposal is as follows:
> > * Request a new repository apache/flink-docker
> > * Migrate all files from docker-flink/docker-flink to apache/flink-docker
> > * Update the release documentation to describe how to update
> > apache/flink-docker for new releases
> >
> > Please review and vote on this proposal as follows:
> > [ ] +1, Approve the proposal
> > [ ] -1, Do not approve the proposal (please provide specific comments)
> >
> > The vote will be open for at least 3 days, ending the earliest on:
> January
> > 30th 2020, 17:00 UTC.
> >
> > Cheers,
> >
> > Ufuk
> >
> > PS: I'm treating this proposal similar to a "Release Plan" as mentioned
> in
> > the project bylaws [2]. Please let me know if you consider this a
> different
> > category.
> >
> > [1]
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36139.html
> > [2]
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026
> >
>


Re: [VOTE] Integrate Flink Docker image publication into Flink release process

2020-01-28 Thread Fabian Hueske
+1

Am Di., 28. Jan. 2020 um 15:23 Uhr schrieb Yun Tang :

> +1 (non-binding)
> 
> From: Stephan Ewen 
> Sent: Tuesday, January 28, 2020 21:36
> To: dev ; patr...@ververica.com <
> patr...@ververica.com>
> Subject: Re: [VOTE] Integrate Flink Docker image publication into Flink
> release process
>
> +1
>
> On Tue, Jan 28, 2020 at 2:20 PM Patrick Lucas 
> wrote:
>
> > Thanks for kicking this off, Ufuk.
> >
> > +1 (non-binding)
> >
> > --
> > Patrick
> >
> > On Mon, Jan 27, 2020 at 5:50 PM Ufuk Celebi  wrote:
> >
> > > Hey all,
> > >
> > > there is a proposal to contribute the Dockerfiles and scripts of
> > > https://github.com/docker-flink/docker-flink to the Flink project. The
> > > discussion corresponding to this vote outlines the reasoning for the
> > > proposal and can be found here: [1].
> > >
> > > The proposal is as follows:
> > > * Request a new repository apache/flink-docker
> > > * Migrate all files from docker-flink/docker-flink to
> apache/flink-docker
> > > * Update the release documentation to describe how to update
> > > apache/flink-docker for new releases
> > >
> > > Please review and vote on this proposal as follows:
> > > [ ] +1, Approve the proposal
> > > [ ] -1, Do not approve the proposal (please provide specific comments)
> > >
> > > The vote will be open for at least 3 days, ending the earliest on:
> > January
> > > 30th 2020, 17:00 UTC.
> > >
> > > Cheers,
> > >
> > > Ufuk
> > >
> > > PS: I'm treating this proposal similar to a "Release Plan" as mentioned
> > in
> > > the project bylaws [2]. Please let me know if you consider this a
> > different
> > > category.
> > >
> > > [1]
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36139.html
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026
> > >
> >
>


Re: connection timeout during shuffle initialization

2020-01-28 Thread Piotr Nowojski
Hi,

> In case of large jar, wouldn't this happen in previous stages as well (if
> so this should not be the case)?

I’m not exactly sure how jars are distributed, but if they are being 
sent/uploaded from one (or some other static/fixed number, like uploading to 
and reading from a DFS) node to all, this might not scale well. Also your dev 
deployment might not be stressing network/storage/something as much as 
production deployment, which can also affect time to deploy the job.

What’s yours job size? (How large is the jar uploaded to Flink?)

Also there might be other factors in play here, like if you are using Flink job 
mode (not stand alone), time to start up a Flink node might be too long. Some 
nodes are already up and running and they are time outing waiting for others to 
start up.

> Also there shouldn't be any state involved
> (unless Beam IO's use it internally).

My bad. Instead of 

> - data size per single job run ranging from 100G to 1TB

I read state size 100G to 1TB.

Piotrek

> 
> On Tue, Jan 28, 2020 at 11:45 AM Piotr Nowojski  wrote:
> 
>> Hi David,
>> 
>> The usual cause for connection time out is long deployment. For example if
>> your Job's jar is large and takes long time to distribute across the
>> cluster. I’m not sure if large state could affect this as well or not. Are
>> you sure that’s not the case?
>> 
>> The think you are suggesting, I haven’t heard about previously, but indeed
>> theoretically it could happen. Reading from mmap’ed sub partitions could
>> block the Netty threads if kernel decides to drop mmap’ed page and it has
>> to be read from the disks. Could you check your CPU and disks IO usage?
>> This should be visible by high IOWait CPU usage. Could you for example post
>> couple of sample results of
>> 
>>iostat -xm 2
>> 
>> command from some representative Task Manager? If indeed disks are
>> overloaded, changing Flink’s config option
>> 
>>taskmanager.network.bounded-blocking-subpartition-type
>> 
>> From default to `file` could solve the problem. FYI, this option is
>> renamed in 1.10 to
>> 
>>taskmanager.network.blocking-shuffle.type
>> 
>> And it’s default value will be `file`.
>> 
>> We would appreciate if you could get back to us with the results!
>> 
>> Piotrek
>> 
>>> On 28 Jan 2020, at 11:03, Till Rohrmann  wrote:
>>> 
>>> Hi David,
>>> 
>>> I'm unfortunately not familiar with these parts of Flink but I'm pulling
>>> Piotr in who might be able to tell you more.
>>> 
>>> Cheers,
>>> Till
>>> 
>>> On Mon, Jan 27, 2020 at 5:58 PM David Morávek  wrote:
>>> 
 Hello community,
 
 I'm currently struggling with an Apache Beam batch pipeline on top of
 Flink. The pipeline runs smoothly in smaller environments, but in
 production it always ends up with `connection timeout` in one of the
>> last
 shuffle phases.
 
 org.apache.flink.runtime.io
 .network.partition.consumer.PartitionConnectionException:
 Connection for partition
 260ddc26547df92babd1c6d430903b9d@da5da68d6d4600bb272d68172a09760f not
 reachable.
   at org.apache.flink.runtime.io
 
>> .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168)
   at org.apache.flink.runtime.io
 
>> .network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:237)
   at org.apache.flink.runtime.io
 
>> .network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:215)
   at
 
>> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65)
   at
 
>> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:866)
   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:621)
   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
   at java.lang.Thread.run(Thread.java:748)
 ...
 Caused by:
 
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection timed out: ##/10.249.28.39:25709
   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
   at ...
 
 Basically the pipeline looks as follows:
 
 read "skewed" sources -> reshuffle -> flatMap (performs a heavy
>> computation
 - few secs per element) -> write to multiple outputs (~4)
 
 - cluster size: 100 tms
 - slots per tm: 4
 - data size per single job run ranging from 100G to 1TB
 - job paralelism: 400
 
 I've tried to increase netty
 `taskmanager.network.netty.client.connectTimeoutSec` with no luck. Also
 increasing # of netty threads did not help. JVM performs ok (no ooms, gc
 pauses, ...). Connect backlog defaults to 128.
 
 This is probably caused by netty threads being blocked on the server
>> side.
 All these threads share the same

[jira] [Created] (FLINK-15788) Various Kubernetes integration improvements

2020-01-28 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-15788:
-

 Summary: Various Kubernetes integration improvements
 Key: FLINK-15788
 URL: https://issues.apache.org/jira/browse/FLINK-15788
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.11.0, 1.10.1


This JIRA ticket is an umbrella issue which collects various improvements we 
can do in order to improve Flink's Kubernetes integration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15789) ActionWatcher.await should throw InterruptedException

2020-01-28 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-15789:
-

 Summary: ActionWatcher.await should throw InterruptedException
 Key: FLINK-15789
 URL: https://issues.apache.org/jira/browse/FLINK-15789
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.11.0, 1.10.1


The {{ActionWatcher.await}} method should declare that it throws an 
{{InterruptedException}} instead of catching it, not resetting the interrupted 
flag and then throwing a different exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15790) Make FlinkKubeClient and its implementations asynchronous

2020-01-28 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-15790:
-

 Summary: Make FlinkKubeClient and its implementations asynchronous
 Key: FLINK-15790
 URL: https://issues.apache.org/jira/browse/FLINK-15790
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.11.0, 1.10.1


The {{FlinkKubeClient}} interface offers several methods which are synchronous 
(e.g. {{FlinkKubeClient.createTaskManagerPod}}, {{FlinkKubeClient.stopPod}}, 
{{FlinkKubeClient.getPodsWithLabels}}, etc). The problem is that these methods 
are directly used by the {{KubernetesResourceManager}} which calls them from 
the main thread. Since these methods perform I/O operations (sending and 
receiving network packages) they can potentially block the execution and hence 
should not happen from the {{RpcEndpoint's}} main thread.

I propose to make all potentially blocking operations on the 
{{FlinkKubeClient}} asynchronous so that the {{KubernetesResourceManager}} does 
not risk to block the main thread. Alternatively, we could also introduce a 
{{FlinkKubeClientAsync}} which offers asynchronous operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15791) Don't use ForkJoinPool#commonPool() for executing asynchronous operations in Fabric8FlinkKubeClient

2020-01-28 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-15791:
-

 Summary: Don't use ForkJoinPool#commonPool() for executing 
asynchronous operations in Fabric8FlinkKubeClient
 Key: FLINK-15791
 URL: https://issues.apache.org/jira/browse/FLINK-15791
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.11.0, 1.10.1


We should not use the {{ForkJoinPool#commonPool()}} in order to run 
asynchronous operations in the {{Fabric8FlinkKubeClient}} as it is done 
[here|https://github.com/apache/flink/blob/master/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/Fabric8FlinkKubeClient.java#L315].
 Since we don't know which other component is using this pool, it can be quite 
dangerous to use it as there might be congestion.

Instead, I propose to provide an explicit I/O {{Executor}} which is used for 
running asynchronous operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15792) Make Flink logs accessible via kubectl logs per default

2020-01-28 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-15792:
-

 Summary: Make Flink logs accessible via kubectl logs per default
 Key: FLINK-15792
 URL: https://issues.apache.org/jira/browse/FLINK-15792
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.11.0, 1.10.1


I think we should make Flink's logs accessible via {{kubectl logs}} per 
default. Firstly, this is the idiomatic way to obtain the logs from a container 
on Kubernetes. Secondly, especially if something does not work and the 
container cannot start/stops abruptly, there is no way to log into the 
container and look for the log.file. This makes debugging the setup quite hard.

I think the best way would be to create the Flink Docker image in such a way 
that it logs to stdout. In order to allow access to the log file from the web 
ui, it should also create a log file. One way to achieve this is to add a 
ConsoleAppender to the respective logging configuration. Another way could be 
to start the process in the console mode and then to teeing the stdout output 
into the log file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15793) Move kubernetes-entry.sh out of FLINK_HOME/bin

2020-01-28 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-15793:
-

 Summary: Move kubernetes-entry.sh out of FLINK_HOME/bin
 Key: FLINK-15793
 URL: https://issues.apache.org/jira/browse/FLINK-15793
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.11.0, 1.10.1


Currently, {{FLINK_HOME/bin}} contains the file {{kubernetes-entry.sh}}. This 
file is used to customize Flink's default Docker image. I think 
{{FLINK_HOME/bin}} should not contain files which cannot be directly used. 
Either we move them to another directory or we incorporate it into Flink's 
default Docker image. If we opt for the latter option, then this task is 
related to FLINK-12546.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15794) Rethink default value of kubernetes.container.image

2020-01-28 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-15794:
-

 Summary: Rethink default value of kubernetes.container.image
 Key: FLINK-15794
 URL: https://issues.apache.org/jira/browse/FLINK-15794
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.11.0, 1.10.1


Currently, the default value of the configuration option 
{{kubernetes.container.image}} is set to {{flink:latest}}. This has the effect 
that we will always start the latest Flink version independent of which version 
we are using to start the Kubernetes cluster.

I am a bit unsure whether this is a good or not so good behaviour and what the 
user would actually expect to happen. Hence, this issue is more to discuss the 
pros and cons about the current default value and whether we should set it to a 
fixed version.

One problem I could see is that we are providing some files from the local 
Flink installation which might be incompatible with {{flink:latest}}. E.g. at 
the moment we think about upgrading Flink's log4j dependency to log4j2. Log4j2 
requires a different configuration file which will most likely replace the 
existing {{log4j.properties}} file in Flink's binary distribution. If we now 
start a K8s session cluster with an older version where we still have the old 
{{log4j.properties}} file, then the logging might not work with a version where 
we are using log4j2. Hence, it might be safer to fix the image version or at 
least do not allow that it uses a different major version and only the latest 
bug fix version.

If we should decide to fix the version, then we would need to update the 
default value with every major release. If this is the case, then the [release 
guide|https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release]
 needs to be updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15795) KubernetesClusterDescriptor seems to report unreachable JobManager interface URL

2020-01-28 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-15795:
-

 Summary: KubernetesClusterDescriptor seems to report unreachable 
JobManager interface URL
 Key: FLINK-15795
 URL: https://issues.apache.org/jira/browse/FLINK-15795
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.11.0, 1.10.1


The {{KubernetesClusterDescriptor}} reports the {{JobManager}} web interface 
URL when deploying the cluster. In my case (using almost the standard 
configuration), it reported an unreachable URL which consisted of the K8s 
cluster's endpoint IP. This might be a configuration problem of the K8s cluster 
but at the same time, the descriptor deployed a LoadBalancer (default 
configuration) through which one could reach the web UI. In this case, I would 
have wished that the {{LoadBalancer's}} IP and port is reported instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15796) Extend Kubernetes documentation on how to use a custom image

2020-01-28 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-15796:
-

 Summary: Extend Kubernetes documentation on how to use a custom 
image
 Key: FLINK-15796
 URL: https://issues.apache.org/jira/browse/FLINK-15796
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.11.0, 1.10.1


I think it could be helpful to extend the existing [Kubernetes 
documentation|https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/native_kubernetes.html]
 with a section which describes how to use it with a custom Flink image. This 
might be a short section.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15797) Reduce logging noise of Fabric8FlinkKubeClient

2020-01-28 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-15797:
-

 Summary: Reduce logging noise of Fabric8FlinkKubeClient
 Key: FLINK-15797
 URL: https://issues.apache.org/jira/browse/FLINK-15797
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.11.0, 1.10.1


The {{Fabric8FlinkKubeClient}} logs a lot of information on log level {{INFO}}. 
I think some of the information is quite specific and is not needed by most 
users. In order to reduce the log noise for our users, I would suggest to only 
log the K8s resource specs on log level {{DEBUG}}. Also the handling of 
{{Watcher}} events does not need to be logged on {{INFO}} level. One could for 
example log that one creates a service with a given name on {{INFO}} and only 
include the specs if the log level is {{DEBUG}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15798) Running ./bin/kubernetes-session.sh -Dkubernetes.cluster-id= -Dexecution.attached=true fails with exception

2020-01-28 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-15798:
-

 Summary: Running ./bin/kubernetes-session.sh 
-Dkubernetes.cluster-id= -Dexecution.attached=true fails with 
exception
 Key: FLINK-15798
 URL: https://issues.apache.org/jira/browse/FLINK-15798
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.11.0, 1.10.1


Running {{./bin/kubernetes-session.sh -Dkubernetes.cluster-id= 
-Dexecution.attached=true}} fails with 

{code}
2020-01-28 15:04:28,669 ERROR 
org.apache.flink.kubernetes.cli.KubernetesSessionCli  - Error while 
running the Flink session.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET 
at: https://35.234.77.125/api/v1/namespaces/default/services/testing. Message: 
Unauthorized! Token may have expired! Please log-in again. Unauthorized.
at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:510)
at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:447)
at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:413)
at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:372)
at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:337)
at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:318)
at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:812)
at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:220)
at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:164)
at 
org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.getService(Fabric8FlinkKubeClient.java:330)
at 
org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.getInternalService(Fabric8FlinkKubeClient.java:243)
at 
org.apache.flink.kubernetes.cli.KubernetesSessionCli.run(KubernetesSessionCli.java:104)
at 
org.apache.flink.kubernetes.cli.KubernetesSessionCli.lambda$main$0(KubernetesSessionCli.java:185)
at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at 
org.apache.flink.kubernetes.cli.KubernetesSessionCli.main(KubernetesSessionCli.java:185)
{code}

even though {{echo "stop" | ./bin/kubernetes-session.sh 
-Dkubernetes.cluster-id= -Dexecution.attached=true}} succeeds with 
my setup. This is strange as I would expect that the former call should do 
exactly the same as the second except for sending the "stop" command right 
away. I think we should check whether this is a real problem or only specific 
to my setup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15799) Support running kubernetes.sh against different context than the current context

2020-01-28 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-15799:
-

 Summary: Support running kubernetes.sh against different context 
than the current context
 Key: FLINK-15799
 URL: https://issues.apache.org/jira/browse/FLINK-15799
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.11.0, 1.10.1


While trying out Flink's new K8s integration, I was wondering whether there is 
a way to let {{kubernetes.sh}} run against another context than the current 
context configured via {{kubectl}}. This could be helpful if one has multiple 
contexts configured and wants to administrate different Flink clusters on 
different K8s clusters/contexts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: connection timeout during shuffle initialization

2020-01-28 Thread Stephan Ewen
Hi!

Concerning JAR files: I think this has nothing to do with it, it is a batch
shuffle after all. The previous stage must have completed already.

A few things that come to my mind:
  - What Flink version are you using? 1.9?
  - Are you sure that the source TaskManager is still running? Earlier
Flink versions had an issue with releasing TMs too early, sometimes before
the result was fetched by a consumer.
  - The buffer creation on the sender / netty server side is more expensive
than necessary, but should be nowhere near as expensive to cause a stall.

Can you elaborate on what the shared lock is that all server threads are
using?

Best,
Stephan


On Tue, Jan 28, 2020 at 4:23 PM Piotr Nowojski  wrote:

> Hi,
>
> > In case of large jar, wouldn't this happen in previous stages as well (if
> > so this should not be the case)?
>
> I’m not exactly sure how jars are distributed, but if they are being
> sent/uploaded from one (or some other static/fixed number, like uploading
> to and reading from a DFS) node to all, this might not scale well. Also
> your dev deployment might not be stressing network/storage/something as
> much as production deployment, which can also affect time to deploy the job.
>
> What’s yours job size? (How large is the jar uploaded to Flink?)
>
> Also there might be other factors in play here, like if you are using
> Flink job mode (not stand alone), time to start up a Flink node might be
> too long. Some nodes are already up and running and they are time outing
> waiting for others to start up.
>
> > Also there shouldn't be any state involved
> > (unless Beam IO's use it internally).
>
> My bad. Instead of
>
> > - data size per single job run ranging from 100G to 1TB
>
> I read state size 100G to 1TB.
>
> Piotrek
>
> >
> > On Tue, Jan 28, 2020 at 11:45 AM Piotr Nowojski 
> wrote:
> >
> >> Hi David,
> >>
> >> The usual cause for connection time out is long deployment. For example
> if
> >> your Job's jar is large and takes long time to distribute across the
> >> cluster. I’m not sure if large state could affect this as well or not.
> Are
> >> you sure that’s not the case?
> >>
> >> The think you are suggesting, I haven’t heard about previously, but
> indeed
> >> theoretically it could happen. Reading from mmap’ed sub partitions could
> >> block the Netty threads if kernel decides to drop mmap’ed page and it
> has
> >> to be read from the disks. Could you check your CPU and disks IO usage?
> >> This should be visible by high IOWait CPU usage. Could you for example
> post
> >> couple of sample results of
> >>
> >>iostat -xm 2
> >>
> >> command from some representative Task Manager? If indeed disks are
> >> overloaded, changing Flink’s config option
> >>
> >>taskmanager.network.bounded-blocking-subpartition-type
> >>
> >> From default to `file` could solve the problem. FYI, this option is
> >> renamed in 1.10 to
> >>
> >>taskmanager.network.blocking-shuffle.type
> >>
> >> And it’s default value will be `file`.
> >>
> >> We would appreciate if you could get back to us with the results!
> >>
> >> Piotrek
> >>
> >>> On 28 Jan 2020, at 11:03, Till Rohrmann  wrote:
> >>>
> >>> Hi David,
> >>>
> >>> I'm unfortunately not familiar with these parts of Flink but I'm
> pulling
> >>> Piotr in who might be able to tell you more.
> >>>
> >>> Cheers,
> >>> Till
> >>>
> >>> On Mon, Jan 27, 2020 at 5:58 PM David Morávek  wrote:
> >>>
>  Hello community,
> 
>  I'm currently struggling with an Apache Beam batch pipeline on top of
>  Flink. The pipeline runs smoothly in smaller environments, but in
>  production it always ends up with `connection timeout` in one of the
> >> last
>  shuffle phases.
> 
>  org.apache.flink.runtime.io
>  .network.partition.consumer.PartitionConnectionException:
>  Connection for partition
>  260ddc26547df92babd1c6d430903b9d@da5da68d6d4600bb272d68172a09760f not
>  reachable.
>    at org.apache.flink.runtime.io
> 
> >>
> .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168)
>    at org.apache.flink.runtime.io
> 
> >>
> .network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:237)
>    at org.apache.flink.runtime.io
> 
> >>
> .network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:215)
>    at
> 
> >>
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65)
>    at
> 
> >>
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:866)
>    at
> org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:621)
>    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>    at java.lang.Thread.run(Thread.java:748)
>  ...
>  Caused by:
> 
> >>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection timed out: ##

Re: [ANNOUNCE] Yu Li became a Flink committer

2020-01-28 Thread aihua li
Congratulations Yu LI, well deserved.

> 2020年1月23日 下午4:59,Stephan Ewen  写道:
> 
> Hi all!
> 
> We are announcing that Yu Li has joined the rank of Flink committers.
> 
> Yu joined already in late December, but the announcement got lost because
> of the Christmas and New Years season, so here is a belated proper
> announcement.
> 
> Yu is one of the main contributors to the state backend components in the
> recent year, working on various improvements, for example the RocksDB
> memory management for 1.10.
> He has also been one of the release managers for the big 1.10 release.
> 
> Congrats for joining us, Yu!
> 
> Best,
> Stephan



[jira] [Created] (FLINK-15800) Refactor download page to make hadoop less prominant

2020-01-28 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-15800:


 Summary: Refactor download page to make hadoop less prominant
 Key: FLINK-15800
 URL: https://issues.apache.org/jira/browse/FLINK-15800
 Project: Flink
  Issue Type: Improvement
  Components: Project Website
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler


The downloads page is listing shaded-hadoop downloads for every Flink release.

This is ultimately redundant since we don't touch the packaging, and it 
reinforces that putting shaded-hadoop into /lib is the "default" way of adding 
hadoop, when users should first try exposing the hadoop classpath.

As such we should move shaded-hadoop into the additional components section.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)