Re: [DISCUSS] FLIP-36 - Support Interactive Programming in Flink Table API

2020-07-30 Thread Xuannan Su
Hi folks,

It seems that all the raised concerns so far have been resolved. I plan to 
start a voting thread for FLIP-36 early next week if there are no comments.

Thanks,
Xuannan
On Jul 28, 2020, 7:42 PM +0800, Xuannan Su , wrote:
> Hi Kurt,
>
> Thanks for the comments.
>
> You are right that the FLIP lacks a proper discussion about the impact of the 
> optimizer. I have added the section to talk about how the cache table works 
> with the optimizer. I hope this could resolve your concern. Please let me 
> know if you have any further comments.
>
> Best,
> Xuannan
> On Jul 22, 2020, 4:36 PM +0800, Kurt Young , wrote:
> > Thanks for the reply, I have one more comment about the optimizer
> > affection. Even if you are
> > trying to make the cached table be as orthogonal to the optimizer as
> > possible by introducing
> > a special sink, it is still not clear why this approach is safe. Maybe you
> > can add some process
> > introduction from API to JobGraph, otherwise I can't make sure everyone
> > reviewing the design
> > doc will have the same imagination about this. And I'm also quite sure some
> > of the existing
> > mechanism will be affected by this special sink, e.g. multi sink
> > optimization.
> >
> > Best,
> > Kurt
> >
> >
> > On Wed, Jul 22, 2020 at 2:31 PM Xuannan Su  wrote:
> >
> > > Hi Kurt,
> > >
> > > Thanks for the comments.
> > >
> > > 1. How do you identify the CachedTable?
> > > For the current design proposed in FLIP-36, we are using the first
> > > approach you mentioned, where the key of the map is the Cached Table java
> > > object. I think it is fine not to be able to identify another table
> > > representing the same DAG and not using the cached intermediate result
> > > because we want to make the caching table explicit. As mentioned in the
> > > FLIP, the cache API will return a Table object. And the user has to use 
> > > the
> > > returned Table object to make use of the cached table. The rationale is
> > > that if the user builds the same DAG from scratch with some
> > > TableEnvironment instead of using the cached table object, the user
> > > probably doesn't want to use the cache.
> > >
> > > 2. How does the CachedTable affect the optimizer?
> > > We try to make the logic dealing with the cached table be as orthogonal to
> > > the optimizer as possible. That's why we introduce a special sink when we
> > > are going to cache a table and a special source when we are going to use a
> > > cached table. This way, we can let the optimizer does it works, and the
> > > logic of modifying the job graph can happen in the job graph generator. We
> > > can recognize the cached node with the special sink and source.
> > >
> > > 3. What's the effect of calling TableEnvironment.close()?
> > > We introduce the close method to prevent leaking of the cached table when
> > > the user is done with the table environment. Therefore, it makes more 
> > > sense
> > > that the table environment, including all of its functionality, should not
> > > be used after closing. Otherwise, we should rename the close method to
> > > clearAllCache or something similar.
> > >
> > > And thanks for pointing out the use of not existing API used in the given
> > > examples. I have updated the examples in the FLIP accordingly.
> > >
> > > Best,
> > > Xuannan
> > > On Jul 16, 2020, 4:15 PM +0800, Kurt Young , wrote:
> > > > Hi Xuanna,
> > > >
> > > > Thanks for the detailed design doc, it described clearly how the API
> > > looks
> > > > and how to interact with Flink runtime.
> > > > However, the part which relates to SQL's optimizer is kind of blurry. To
> > > be
> > > > more precise, I have following questions:
> > > >
> > > > 1. How do you identify the CachedTable? I can imagine there would be map
> > > > representing the cache, how do you
> > > > compare the keys of the map? One approach is they will be compared by
> > > java
> > > > objects, which is simple but has
> > > > limited scope. For example, users created another table using some
> > > > interfaces of TableEnvironment, and the table
> > > > is exactly the same as the cached one, you won't be able to identify it.
> > > > Another choice is calculating the "signature" or
> > > > "diest" of the cached table, which involves string representation of the
> > > > whole sub tree represented by the cached table.
> > > > I don't think Flink currently provides such a mechanism around Table
> > > > though.
> > > >
> > > > 2. How does the CachedTable affect the optimizer? Specifically, will you
> > > > have a dedicated QueryOperation for it, will you have
> > > > a dedicated logical & physical RelNode for it? And I also don't see a
> > > > description about how to work with current optimize phases,
> > > > from Operation to Calcite rel node, and then to Flink's logical and
> > > > physical node, which will be at last translated to Flink's exec node.
> > > > There also exists other optimizations such as dead lock breaker, as well
> > > as
> > > > sub plan reuse inside

[jira] [Created] (FLINK-18761) Support Python DataStream API (Stateless part)

2020-07-30 Thread Hequn Cheng (Jira)
Hequn Cheng created FLINK-18761:
---

 Summary: Support Python DataStream API (Stateless part)
 Key: FLINK-18761
 URL: https://issues.apache.org/jira/browse/FLINK-18761
 Project: Flink
  Issue Type: New Feature
  Components: API / DataStream, API / Python
Reporter: Hequn Cheng


This is the umbrella Jira for FLIP-130, which intends to support Python 
DataStream API for the stateless part.

FLIP wiki page: 
[https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298]

As we all know, Flink provides [three layered 
APIs|https://flink.apache.org/flink-applications.html#layered-apis]: the 
ProcessFunctions, the DataStream API and the SQL & Table API. Each API offers a 
different trade-off between conciseness and expressiveness and targets 
different use cases.

Currently, the SQL & Table API has already been supported in PyFlink. The API 
provides relational operations as well as user-defined functions to provide 
convenience for users who are familiar with python and relational programming.

Meanwhile, the DataStream API and ProcessFunctions provide more generic APIs to 
implement stream processing applications. The ProcessFunctions expose time and 
state which are the fundamental building blocks for any kind of streaming 
application. To cover more use cases, we are planning to cover all these APIs 
in PyFlink.

In this FLIP, we propose to support the Python DataStream API for the stateless 
part. For more detail, please refer to the [FLIP wiki 
page|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298].
 As for the stateful part, it would come later after this FLIP. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18762) Make network buffers per incoming/outgoing channel can be configured separately

2020-07-30 Thread Yingjie Cao (Jira)
Yingjie Cao created FLINK-18762:
---

 Summary: Make network buffers per incoming/outgoing channel can be 
configured separately
 Key: FLINK-18762
 URL: https://issues.apache.org/jira/browse/FLINK-18762
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Network
Reporter: Yingjie Cao
 Fix For: 1.12.0


In FLINK-16012, we want to decrease the default number of exclusive buffers at 
receiver side from 2 to 1 to accelerate checkpoint in cases of backpressure. 
However, number of buffers per outgoing and incoming channels are configured by 
a single configuration key. It is better to make network buffers per 
incoming/outgoing channel can be configured separately which is more flexible. 
At the same time, we can keep the default behavior compatible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18763) Support basic TypeInformation for Python DataStream API

2020-07-30 Thread Hequn Cheng (Jira)
Hequn Cheng created FLINK-18763:
---

 Summary: Support basic TypeInformation for Python DataStream API
 Key: FLINK-18763
 URL: https://issues.apache.org/jira/browse/FLINK-18763
 Project: Flink
  Issue Type: Sub-task
Reporter: Hequn Cheng


Supports basic TypeInformation including BasicTypeInfo, LocalTimeTypeInfo, 
PrimitiveArrayTypeInfo, RowTypeInfo. 

Types.ROW()/Types.ROW_NAMED()/Types.PRIMITIVE_ARRAY() should also be supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18764) Support from_collection for Python DataStream API

2020-07-30 Thread Hequn Cheng (Jira)
Hequn Cheng created FLINK-18764:
---

 Summary: Support from_collection for Python DataStream API
 Key: FLINK-18764
 URL: https://issues.apache.org/jira/browse/FLINK-18764
 Project: Flink
  Issue Type: Sub-task
Reporter: Hequn Cheng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18765) Support map() and flat_map() for Python DataStream API

2020-07-30 Thread Hequn Cheng (Jira)
Hequn Cheng created FLINK-18765:
---

 Summary: Support map() and flat_map() for Python DataStream API
 Key: FLINK-18765
 URL: https://issues.apache.org/jira/browse/FLINK-18765
 Project: Flink
  Issue Type: Sub-task
Reporter: Hequn Cheng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18766) Support add_sink() for Python DataStream API

2020-07-30 Thread Hequn Cheng (Jira)
Hequn Cheng created FLINK-18766:
---

 Summary: Support add_sink() for Python DataStream API
 Key: FLINK-18766
 URL: https://issues.apache.org/jira/browse/FLINK-18766
 Project: Flink
  Issue Type: Sub-task
Reporter: Hequn Cheng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Márton Balassi
Hi All,

Thanks for the write up and starting the discussion. I am in favor of
unifying the APIs the way described in the FLIP and deprecating the DataSet
API. I am looking forward to the detailed discussion of the changes
necessary.

Best,
Marton

On Wed, Jul 29, 2020 at 12:46 PM Aljoscha Krettek 
wrote:

> Hi Everyone,
>
> my colleagues (in cc) and I would like to propose this FLIP for
> discussion. In short, we want to reduce the number of APIs that we have
> by deprecating the DataSet API. This is a big step for Flink, that's why
> I'm also cross-posting this to the User Mailing List.
>
> FLIP-131: http://s.apache.org/FLIP-131
>
> I'm posting the introduction of the FLIP below but please refer to the
> document linked above for the full details:
>
> --
> Flink provides three main SDKs/APIs for writing Dataflow Programs: Table
> API/SQL, the DataStream API, and the DataSet API. We believe that this
> is one API too many and propose to deprecate the DataSet API in favor of
> the Table API/SQL and the DataStream API. Of course, this is easier said
> than done, so in the following, we will outline why we think that having
> too many APIs is detrimental to the project and community. We will then
> describe how we can enhance the Table API/SQL and the DataStream API to
> subsume the DataSet API's functionality.
>
> In this FLIP, we will not describe all the technical details of how the
> Table API/SQL and DataStream will be enhanced. The goal is to achieve
> consensus on the idea of deprecating the DataSet API. There will have to
> be follow-up FLIPs that describe the necessary changes for the APIs that
> we maintain.
> --
>
> Please let us know if you have any concerns or comments. Also, please
> keep discussion to this ML thread instead of commenting in the Wiki so
> that we can have a consistent view of the discussion.
>
> Best,
> Aljoscha
>


Re: [DISCUSS] Introduce SupportsParallelismReport and SupportsStatisticsReport for Hive and Filesystem

2020-07-30 Thread Kurt Young
Hi Jingsong,

Thanks for bringing up this discussion. In general, I'm +1 to enrich the
source ability by
the parallelism and stats reporting, but I'm not sure whether introducing
such "Supports"
interface is a good idea. I will share my thoughts separately.

1) Regarding the interface SupportsParallelismReport, first of all, my
feeling is that such a mechanism
is not like other abilities like SupportsProjectionPushDown. Parallelism of
source operator would be
decided anyway, the only difference here is whether it's decided purely by
framework or by table source
itself. So another angle to understand this issue is, we can always assume
a table source has the
ability to determine the parallelism. The table source can choose to set
the parallelism by itself, or delegate
it to the framework.

This might sound like personal taste, but there is another bad case if we
introduce the interface. You
may already know we currently have two major table
sources, LookupTableSource and ScanTableSource.
IIUC it won't make much sense if the user provides a LookupTableSource and
also implements
SupportsParallelismReport.

An alternative solution would be add the method you want directly
to ScanTableSource, and also have
a default implementation returning -1, which means letting framework to
decide the parallelism.

2) Regarding the interface SupportsStatisticsReport, it seems this
interface doesn't work for unbounded
streaming table sources. What kind of implementation do you expect in such
a case? And how does this
interface work with LookupTableSource?
Another question is what the oldStats parameter is used for?

3) Internal or Public. I don't think we should mark them as internal. They
are currently only used by internal
connectors doesn't mean this interface should be internal. I can imagine
there will be lots of Filesystem like
connectors outside the project which need such capability.

Best,
Kurt


On Thu, Jul 30, 2020 at 1:02 PM Benchao Li  wrote:

> Hi Jingsong,
>
> Regarding SupportsParallelismReport,
> I think the streaming connectors can also benefit from it.
> I see some requirements from user ML that they want to control
> source/sink's parallelism instead
> to set them to global parallelism.
> Also, in our compony, we did this too.
>
> Jingsong Li  于2020年7月30日周四 上午11:16写道:
>
> > Hi all,
> >
> > ## SupportsParallelismReport
> >
> > Now that FLIP-95 [1] is ready, only Hive and Filesystem are still using
> the
> > old interfaces.
> >
> > We are considering migrating to the new interface.
> >
> > However, one problem is that in the old interface implementation,
> > connectors infer parallelism by itself instead of a global parallelism
> > configuration. Hive & filesystem determines the parallelism size
> according
> > to the number of files and the size of the file. In this way, large
> tables
> > may use thousands of parallelisms, while small tables only have 10
> > parallelisms, which can minimize the consumption of task scheduling.
> >
> > This situation is very common in batch computing. For example, in the
> star
> > model, a large table needs to be joined with multiple small tables.
> >
> > So we should give this ability to new table source interfaces. The
> > interface can be:
> >
> > /**
> >  * Enables to give source the ability to report parallelism.
> >  *
> >  * After filtering push down and partition push down, the source
> > can have more information,
> >  * which can help it infer more effective parallelism.
> >  */
> > @Internal
> > public interface SupportsParallelismReport {
> >
> >/**
> > * Report parallelism from source or sink. The parallelism of an
> > operator must be at least 1,
> > * or -1 (use system default).
> > */
> >int reportParallelism();
> > }
> >
> >
> > Rejected Alternatives:
> > - SupportsSplitReport: What is the relationship between this split and
> the
> > split of FLIP-27? Do we have to match them one by one? I think they are
> two
> > independent things. In fact, the design of FLIP-27, split and parallelism
> > are not bound one by one.
> > - SupportsPartitionReport: What is partition? Actually, in table/SQL,
> > partition is a special concept of table. It should not be mixed with
> > parallelism.
> >
> > ## SupportsStatisticsReport
> >
> > As with parallelism, statistics information from source will be more
> > appropriate and accurate. After filtering push down and partition push
> > down, the source can have more information, which can help it infer more
> > effective statistics. However, if we only infer from the planner itself,
> it
> > may lead to a big gap between the statistics information and the real
> > situation.
> >
> > The interface:
> >
> > /**
> >  * Enables to give {@link ScanTableSource} the ability to report table
> > statistics.
> >  *
> >  * Statistics can be inferred from real data in real time,  it is
> > more accurate than the
> >  * statistics in the catalog.
> >  *
> >  * After filtering push down and partition push down, the s

Re: Checkpointing under backpressure

2020-07-30 Thread Arvid Heise
Dear all,

I just wanted to follow-up on this long discussion thread by announcing
that we implemented unaligned checkpoints in Flink 1.11. If you experience
long end-to-end checkpointing duration, you should try out unaligned
checkpoints [1] if the following applies:

   - Checkpointing is not bottlenecked by I/O (to state backend). Possible
   reasons are: slow connections, rate limits, or huge operator or user state.
   - You can attribute the long duration to slow data flow. An operator in
   the pipeline is severely lagging behind and you can easily see it in Flink
   Web UI.
   - You cannot alleviate the problem by adjusting the degree of
   parallelism to the slow operator, either because of temporal spikes or lags
   or because you don’t control the application in a platform-as-a-service
   architecture.

You can enable it in the flink-conf.yaml.
execution.checkpointing.unaligned: true

Or in your application:
env.getCheckpointConfig().enableUnalignedCheckpoints() (Java/Scala)
env.get_checkpoint_config().enable_unaligned_checkpoints() (Python)

Note that this relatively young feature still has a couple of limitations
that we resolve in future versions.

   - You cannot rescale or change the job graph when starting from an
   unaligned checkpoint; you have to take a savepoint before rescaling.
   Savepoints are always aligned, independent of the alignment setting of
   checkpoints. This feature has the highest priority and will be available in
   upcoming releases.
   - Flink currently does not support concurrent unaligned checkpoints.
   However, due to the more predictable and shorter checkpointing times,
   concurrent checkpoints might not be needed at all. However, savepoints can
   also not happen concurrently to unaligned checkpoints, so they will take
   slightly longer.
   - SourceFunctions are user-defined, run a separate thread, and output
   records under lock. When they block because of backpressure, the induced
   checkpoints cannot acquire the lock and checkpointing duration increases.
   We will provide SourceFunctions a way to also avoid blocking and implement
   it for all sources in Flink core, but because the code is ultimately
   user-defined, we have no way to guarantee non-blocking behavior.
   Nevertheless, since only sources are affected, the checkpointing durations
   are still much lower and most importantly do not increase with further
   shuffles. Furthermore, Flink 1.11 also provides a new way to implement
   sources (FLIP-27). This new source interface has a better threading model,
   such that users do not create their own threads anymore and Flink can
   guarantee non-blocking behavior for these sources.
   - Unaligned checkpoints break with an implicit guarantee in respect to
   watermarks during recovery. Currently, Flink generates the watermark as a
   first step of recovery instead of storing the latest watermark in the
   operators to ease rescaling. For unaligned checkpoints, this means that, on
   recovery, Flink generates watermarks after it restores in-flight data. If
   your pipeline uses an operator that applies the latest watermark on each
   record, it will produce different results than for aligned checkpoints. If
   your operator depends on the latest watermark being always available, then
   the proper solution is to store the watermark in the operator state. To
   support rescaling, watermarks should be stored per key-group in a
   union-state. This feature has a high priority.
   - Lastly, there is a conceptual weakness in unaligned checkpoints: when
   an operator produces an arbitrary amount of outputs for a single input,
   such as flatMap, all of these output records need to be stored into the
   state for the unaligned checkpoint, which may increase the state size by
   orders of magnitudes and slow down checkpointing and recovery. However,
   since flatMap only needs alignment after a shuffle and rarely produces a
   huge number of records for a single input, it’s more of a theoretic
   problem.

Lastly, we also plan to improve the configurations, such that ultimately,
unaligned checkpoints will be the default configuration.

   - Users will be able to configure a timeout, such that each operator
   first tries to perform an aligned checkpoint. If the timeout is triggered,
   it switches to an unaligned checkpoint. Since the timeout would only
   trigger in the niche use cases that unaligned checkpoints addresses, it
   would mostly perform an aligned checkpoint under no or low backpressure.
   Thus, together with the previously mentioned fixes for the limitation, this
   timeout would allow Flink to enable unaligned checkpoints by default.
   - Another idea is to provide users to define a maximum state size for
   the in-flight data. However, it might be hard for users to configure the
   size correctly as it also requires to know how many buffers are used in the
   respective application and it might be even harder to actually use the size
  

[jira] [Created] (FLINK-18767) Streaming job stuck when disabling operator chaining

2020-07-30 Thread Nico Kruber (Jira)
Nico Kruber created FLINK-18767:
---

 Summary: Streaming job stuck when disabling operator chaining
 Key: FLINK-18767
 URL: https://issues.apache.org/jira/browse/FLINK-18767
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.11.1, 1.10.1, 1.9.3, 1.8.3
Reporter: Nico Kruber


The following code is stuck sending data from the source to the map operator. 
Two settings seem to have an influence here: {{env.setBufferTimeout(-1);}} and 
{{env.disableOperatorChaining();}} - if I remove either of these, the job works 
as expected.

(I pre-populated my Kafka topic with one element to reproduce easily)

{code}
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();

// comment either these two and the job works
env.setBufferTimeout(-1);
env.disableOperatorChaining(); 

Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "test");
FlinkKafkaConsumer consumer = new FlinkKafkaConsumer<>("topic", new 
SimpleStringSchema(),
properties);
consumer.setStartFromEarliest();
DataStreamSource input = env.addSource(
consumer);

input
.map((x) -> x)
.print();

env.execute();
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18768) Imporve SQL kafka connector docs about passing kafka properties

2020-07-30 Thread Leonard Xu (Jira)
Leonard Xu created FLINK-18768:
--

 Summary: Imporve SQL kafka connector docs about passing kafka 
properties
 Key: FLINK-18768
 URL: https://issues.apache.org/jira/browse/FLINK-18768
 Project: Flink
  Issue Type: Improvement
Reporter: Leonard Xu


SQL kafka connector support passthrough kafka properties 
like:'message.max.bytes', 'replica.fetch.max.bytes' and other kafka properties, 
but current docs and test missed this case.

I think we can add some test and explanation  in docs. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18769) Streaming Table job stuck when enabling minibatching

2020-07-30 Thread Nico Kruber (Jira)
Nico Kruber created FLINK-18769:
---

 Summary: Streaming Table job stuck when enabling minibatching
 Key: FLINK-18769
 URL: https://issues.apache.org/jira/browse/FLINK-18769
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Runtime
Affects Versions: 1.11.1
Reporter: Nico Kruber


The following Table API streaming job is stuck when enabling mini batching

{code}
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
EnvironmentSettings settings =

EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, 
settings);

// disable mini-batching completely to get a result
Configuration tableConf = tableEnv.getConfig()
.getConfiguration();
tableConf.setString("table.exec.mini-batch.enabled", "true");
tableConf.setString("table.exec.mini-batch.allow-latency", "5 s");
tableConf.setString("table.exec.mini-batch.size", "5000");
tableConf.setString("table.optimizer.agg-phase-strategy", "TWO_PHASE");

tableEnv.executeSql(
"CREATE TABLE input_table ("
+ "location STRING, "
+ "population INT"
+ ") WITH ("
+ "'connector' = 'kafka', "
+ "'topic' = 'kafka_batching_input', "
+ "'properties.bootstrap.servers' = 'localhost:9092', "
+ "'format' = 'csv', "
+ "'scan.startup.mode' = 'earliest-offset'"
+ ")");

tableEnv.executeSql(
"CREATE TABLE result_table WITH ('connector' = 'print') LIKE 
input_table (EXCLUDING OPTIONS)");

tableEnv
.from("input_table")
.groupBy($("location"))
.select($("location").cast(DataTypes.CHAR(2)).as("location"), 
$("population").sum().as("population"))
.executeInsert("result_table");
{code}

I am using a pre-populated Kafka topic called {{kafka_batching_input}} with 
these elements:
{code}
"Berlin",1
"Berlin",2
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Arvid Heise
+1 of getting rid of the DataSet API. Is DataStream#iterate already
superseding DataSet iterations or would that also need to be accounted for?

In general, all surviving APIs should also offer a smooth experience for
switching back and forth.

On Thu, Jul 30, 2020 at 9:39 AM Márton Balassi 
wrote:

> Hi All,
>
> Thanks for the write up and starting the discussion. I am in favor of
> unifying the APIs the way described in the FLIP and deprecating the DataSet
> API. I am looking forward to the detailed discussion of the changes
> necessary.
>
> Best,
> Marton
>
> On Wed, Jul 29, 2020 at 12:46 PM Aljoscha Krettek 
> wrote:
>
>> Hi Everyone,
>>
>> my colleagues (in cc) and I would like to propose this FLIP for
>> discussion. In short, we want to reduce the number of APIs that we have
>> by deprecating the DataSet API. This is a big step for Flink, that's why
>> I'm also cross-posting this to the User Mailing List.
>>
>> FLIP-131: http://s.apache.org/FLIP-131
>>
>> I'm posting the introduction of the FLIP below but please refer to the
>> document linked above for the full details:
>>
>> --
>> Flink provides three main SDKs/APIs for writing Dataflow Programs: Table
>> API/SQL, the DataStream API, and the DataSet API. We believe that this
>> is one API too many and propose to deprecate the DataSet API in favor of
>> the Table API/SQL and the DataStream API. Of course, this is easier said
>> than done, so in the following, we will outline why we think that having
>> too many APIs is detrimental to the project and community. We will then
>> describe how we can enhance the Table API/SQL and the DataStream API to
>> subsume the DataSet API's functionality.
>>
>> In this FLIP, we will not describe all the technical details of how the
>> Table API/SQL and DataStream will be enhanced. The goal is to achieve
>> consensus on the idea of deprecating the DataSet API. There will have to
>> be follow-up FLIPs that describe the necessary changes for the APIs that
>> we maintain.
>> --
>>
>> Please let us know if you have any concerns or comments. Also, please
>> keep discussion to this ML thread instead of commenting in the Wiki so
>> that we can have a consistent view of the discussion.
>>
>> Best,
>> Aljoscha
>>
>

-- 

Arvid Heise | Senior Java Developer



Follow us @VervericaData

--

Join Flink Forward  - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng


[jira] [Created] (FLINK-18770) Emitting element fails in KryoSerializer

2020-07-30 Thread Leonid Ilyevsky (Jira)
Leonid Ilyevsky created FLINK-18770:
---

 Summary: Emitting element fails in KryoSerializer
 Key: FLINK-18770
 URL: https://issues.apache.org/jira/browse/FLINK-18770
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.11.1
 Environment: Flink 1.11.1, Linux
Reporter: Leonid Ilyevsky
 Attachments: KryoException.txt, SolaceSource.java

I wrote a simple Flink connector for Solace, see attached java file. It works 
fine under local execution environment. However, when I deployed it in the real 
Flink cluster, it failed with the Kryo exception, see attached.

After a few hours of search and debugging, I can see now what is going on.

The data I want to emit from this source is a simple byte array. In the 
exception stack you can see that when I call 'collect' on the context, it goes 
into OperatorChain.java:715, and then to KryoSerializer, where it ultimately 
fails. I didn't have a chance to learn what KryoSerializer is and why it would 
not know what to do with byte[], but that is not the point now.

Then I used debugger in my local test, in order to figure out how it manages to 
work. I saw that after OperatorChain.java:715 it goes into 
BytePrimitiveArraySerializer, and then everything is working as expected. 
Obviously BytePrimitiveArraySerializer makes sense for byte[] data.

The question is, how can I configure the execution environment under cluster so 
that it does serialization the same way as the local one? I looked at 
[https://ci.apache.org/projects/flink/flink-docs-stable/dev/execution_configuration.html]
 , and I was thinking of setting disableForceKryo, but it says it is disabled 
by default anyway.

 

Another question is, why cluster execution environment has different default 
settings compare to local? This makes it difficult to rely on local tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18771) "Kerberized YARN per-job on Docker test" failed with "Client cannot authenticate via:[TOKEN, KERBEROS]"

2020-07-30 Thread Dian Fu (Jira)
Dian Fu created FLINK-18771:
---

 Summary: "Kerberized YARN per-job on Docker test" failed with 
"Client cannot authenticate via:[TOKEN, KERBEROS]"
 Key: FLINK-18771
 URL: https://issues.apache.org/jira/browse/FLINK-18771
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN
Affects Versions: 1.12.0
Reporter: Dian Fu


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5047&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529

{code}
2020-07-30T11:18:39.9217453Z java.io.IOException: Failed on local exception: 
java.io.IOException: org.apache.hadoop.security.AccessControlException: Client 
cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: 
"worker1.docker-hadoop-cluster-network/172.19.0.5"; destination host is: 
"master.docker-hadoop-cluster-network":9000; 
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18772) Hide submit job web ui elements when running in per-job/application mode

2020-07-30 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-18772:
-

 Summary: Hide submit job web ui elements when running in 
per-job/application mode
 Key: FLINK-18772
 URL: https://issues.apache.org/jira/browse/FLINK-18772
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Web Frontend
Affects Versions: 1.11.1, 1.10.1, 1.12.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.12.0


When running Flink in per-job/application mode, we already instantiate a 
{{MiniDispatcherRestEndpoint}} which does not initialize job submission 
handlers. However, the web ui still shows the submit job link which can be 
confusing for our users. Hence, I propose to disable the web ui based job 
submission via {{web.submit.enable}} when running the per-job/application mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Kurt Young
+1, looking forward to the follow up FLIPs.

Best,
Kurt


On Thu, Jul 30, 2020 at 6:40 PM Arvid Heise  wrote:

> +1 of getting rid of the DataSet API. Is DataStream#iterate already
> superseding DataSet iterations or would that also need to be accounted for?
>
> In general, all surviving APIs should also offer a smooth experience for
> switching back and forth.
>
> On Thu, Jul 30, 2020 at 9:39 AM Márton Balassi 
> wrote:
>
> > Hi All,
> >
> > Thanks for the write up and starting the discussion. I am in favor of
> > unifying the APIs the way described in the FLIP and deprecating the
> DataSet
> > API. I am looking forward to the detailed discussion of the changes
> > necessary.
> >
> > Best,
> > Marton
> >
> > On Wed, Jul 29, 2020 at 12:46 PM Aljoscha Krettek 
> > wrote:
> >
> >> Hi Everyone,
> >>
> >> my colleagues (in cc) and I would like to propose this FLIP for
> >> discussion. In short, we want to reduce the number of APIs that we have
> >> by deprecating the DataSet API. This is a big step for Flink, that's why
> >> I'm also cross-posting this to the User Mailing List.
> >>
> >> FLIP-131: http://s.apache.org/FLIP-131
> >>
> >> I'm posting the introduction of the FLIP below but please refer to the
> >> document linked above for the full details:
> >>
> >> --
> >> Flink provides three main SDKs/APIs for writing Dataflow Programs: Table
> >> API/SQL, the DataStream API, and the DataSet API. We believe that this
> >> is one API too many and propose to deprecate the DataSet API in favor of
> >> the Table API/SQL and the DataStream API. Of course, this is easier said
> >> than done, so in the following, we will outline why we think that having
> >> too many APIs is detrimental to the project and community. We will then
> >> describe how we can enhance the Table API/SQL and the DataStream API to
> >> subsume the DataSet API's functionality.
> >>
> >> In this FLIP, we will not describe all the technical details of how the
> >> Table API/SQL and DataStream will be enhanced. The goal is to achieve
> >> consensus on the idea of deprecating the DataSet API. There will have to
> >> be follow-up FLIPs that describe the necessary changes for the APIs that
> >> we maintain.
> >> --
> >>
> >> Please let us know if you have any concerns or comments. Also, please
> >> keep discussion to this ML thread instead of commenting in the Wiki so
> >> that we can have a consistent view of the discussion.
> >>
> >> Best,
> >> Aljoscha
> >>
> >
>
> --
>
> Arvid Heise | Senior Java Developer
>
> 
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward  - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Toni) Cheng
>


Re: [DISCUSS] Introduce SupportsParallelismReport and SupportsStatisticsReport for Hive and Filesystem

2020-07-30 Thread godfrey he
Thanks Jingsong for bringing up this discussion,
 and thanks Kurt for the detailed thoughts.

First of all, I also think it's a very useful feature to expose more
ability for table source.

1) If we want to support [1], it's seem that SupportsParallelismReport
does not meet the requirement: If there are multiple Transformations in
source op,
and they require different parallelism.

2) regarding to "SupportsXXX" for ScanTableSource or LookupTableSource,
Currently, we also do not distinguish them for the existing "SupportsXXX".
Such as a LookupTableSource should not extend from SupportsWatermarkPushDown
and SupportsComputedColumnPushDown.
A DynamicTableSource sub-class will extend from "SupportsXXX" only if it
has the capability,
So the unbounded table source should not extend from
SupportsStatisticsReport,
or just return unknown for unbounded if a table source can work for both
bounded and unbounded.

I think SupportsStatisticsReport is a supplement to catalog statistics,
that means
only catalog statistic is unknown, SupportsStatisticsReport works.

3)  +1 to make them as public.

[1] https://issues.apache.org/jira/browse/FLINK-18674

Best,
Godfrey



Kurt Young  于2020年7月30日周四 下午4:01写道:

> Hi Jingsong,
>
> Thanks for bringing up this discussion. In general, I'm +1 to enrich the
> source ability by
> the parallelism and stats reporting, but I'm not sure whether introducing
> such "Supports"
> interface is a good idea. I will share my thoughts separately.
>
> 1) Regarding the interface SupportsParallelismReport, first of all, my
> feeling is that such a mechanism
> is not like other abilities like SupportsProjectionPushDown. Parallelism of
> source operator would be
> decided anyway, the only difference here is whether it's decided purely by
> framework or by table source
> itself. So another angle to understand this issue is, we can always assume
> a table source has the
> ability to determine the parallelism. The table source can choose to set
> the parallelism by itself, or delegate
> it to the framework.
>
> This might sound like personal taste, but there is another bad case if we
> introduce the interface. You
> may already know we currently have two major table
> sources, LookupTableSource and ScanTableSource.
> IIUC it won't make much sense if the user provides a LookupTableSource and
> also implements
> SupportsParallelismReport.
>
> An alternative solution would be add the method you want directly
> to ScanTableSource, and also have
> a default implementation returning -1, which means letting framework to
> decide the parallelism.
>
> 2) Regarding the interface SupportsStatisticsReport, it seems this
> interface doesn't work for unbounded
> streaming table sources. What kind of implementation do you expect in such
> a case? And how does this
> interface work with LookupTableSource?
> Another question is what the oldStats parameter is used for?
>
> 3) Internal or Public. I don't think we should mark them as internal. They
> are currently only used by internal
> connectors doesn't mean this interface should be internal. I can imagine
> there will be lots of Filesystem like
> connectors outside the project which need such capability.
>
> Best,
> Kurt
>
>
> On Thu, Jul 30, 2020 at 1:02 PM Benchao Li  wrote:
>
> > Hi Jingsong,
> >
> > Regarding SupportsParallelismReport,
> > I think the streaming connectors can also benefit from it.
> > I see some requirements from user ML that they want to control
> > source/sink's parallelism instead
> > to set them to global parallelism.
> > Also, in our compony, we did this too.
> >
> > Jingsong Li  于2020年7月30日周四 上午11:16写道:
> >
> > > Hi all,
> > >
> > > ## SupportsParallelismReport
> > >
> > > Now that FLIP-95 [1] is ready, only Hive and Filesystem are still using
> > the
> > > old interfaces.
> > >
> > > We are considering migrating to the new interface.
> > >
> > > However, one problem is that in the old interface implementation,
> > > connectors infer parallelism by itself instead of a global parallelism
> > > configuration. Hive & filesystem determines the parallelism size
> > according
> > > to the number of files and the size of the file. In this way, large
> > tables
> > > may use thousands of parallelisms, while small tables only have 10
> > > parallelisms, which can minimize the consumption of task scheduling.
> > >
> > > This situation is very common in batch computing. For example, in the
> > star
> > > model, a large table needs to be joined with multiple small tables.
> > >
> > > So we should give this ability to new table source interfaces. The
> > > interface can be:
> > >
> > > /**
> > >  * Enables to give source the ability to report parallelism.
> > >  *
> > >  * After filtering push down and partition push down, the source
> > > can have more information,
> > >  * which can help it infer more effective parallelism.
> > >  */
> > > @Internal
> > > public interface SupportsParallelismReport {
> > >
> > >/**
> > > * Report parallelism from 

Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Till Rohrmann
+1 for this effort. Great to see that we are making progress towards our
goal of a truly unified batch and stream processing engine.

Cheers,
Till

On Thu, Jul 30, 2020 at 2:28 PM Kurt Young  wrote:

> +1, looking forward to the follow up FLIPs.
>
> Best,
> Kurt
>
>
> On Thu, Jul 30, 2020 at 6:40 PM Arvid Heise  wrote:
>
>> +1 of getting rid of the DataSet API. Is DataStream#iterate already
>> superseding DataSet iterations or would that also need to be accounted
>> for?
>>
>> In general, all surviving APIs should also offer a smooth experience for
>> switching back and forth.
>>
>> On Thu, Jul 30, 2020 at 9:39 AM Márton Balassi 
>> wrote:
>>
>> > Hi All,
>> >
>> > Thanks for the write up and starting the discussion. I am in favor of
>> > unifying the APIs the way described in the FLIP and deprecating the
>> DataSet
>> > API. I am looking forward to the detailed discussion of the changes
>> > necessary.
>> >
>> > Best,
>> > Marton
>> >
>> > On Wed, Jul 29, 2020 at 12:46 PM Aljoscha Krettek 
>> > wrote:
>> >
>> >> Hi Everyone,
>> >>
>> >> my colleagues (in cc) and I would like to propose this FLIP for
>> >> discussion. In short, we want to reduce the number of APIs that we have
>> >> by deprecating the DataSet API. This is a big step for Flink, that's
>> why
>> >> I'm also cross-posting this to the User Mailing List.
>> >>
>> >> FLIP-131: http://s.apache.org/FLIP-131
>> >>
>> >> I'm posting the introduction of the FLIP below but please refer to the
>> >> document linked above for the full details:
>> >>
>> >> --
>> >> Flink provides three main SDKs/APIs for writing Dataflow Programs:
>> Table
>> >> API/SQL, the DataStream API, and the DataSet API. We believe that this
>> >> is one API too many and propose to deprecate the DataSet API in favor
>> of
>> >> the Table API/SQL and the DataStream API. Of course, this is easier
>> said
>> >> than done, so in the following, we will outline why we think that
>> having
>> >> too many APIs is detrimental to the project and community. We will then
>> >> describe how we can enhance the Table API/SQL and the DataStream API to
>> >> subsume the DataSet API's functionality.
>> >>
>> >> In this FLIP, we will not describe all the technical details of how the
>> >> Table API/SQL and DataStream will be enhanced. The goal is to achieve
>> >> consensus on the idea of deprecating the DataSet API. There will have
>> to
>> >> be follow-up FLIPs that describe the necessary changes for the APIs
>> that
>> >> we maintain.
>> >> --
>> >>
>> >> Please let us know if you have any concerns or comments. Also, please
>> >> keep discussion to this ML thread instead of commenting in the Wiki so
>> >> that we can have a consistent view of the discussion.
>> >>
>> >> Best,
>> >> Aljoscha
>> >>
>> >
>>
>> --
>>
>> Arvid Heise | Senior Java Developer
>>
>> 
>>
>> Follow us @VervericaData
>>
>> --
>>
>> Join Flink Forward  - The Apache Flink
>> Conference
>>
>> Stream Processing | Event Driven | Real Time
>>
>> --
>>
>> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>>
>> --
>> Ververica GmbH
>> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
>> (Toni) Cheng
>>
>


[jira] [Created] (FLINK-18773) Enable parallel classloading

2020-07-30 Thread Arvid Heise (Jira)
Arvid Heise created FLINK-18773:
---

 Summary: Enable parallel classloading
 Key: FLINK-18773
 URL: https://issues.apache.org/jira/browse/FLINK-18773
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Task
Affects Versions: 1.12.0
Reporter: Arvid Heise
Assignee: Arvid Heise


Currently, user and plugin classloader do not support parallel classloading. We 
should support that to accelerate job start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18774) Support debezium-avro format

2020-07-30 Thread Jark Wu (Jira)
Jark Wu created FLINK-18774:
---

 Summary: Support debezium-avro format 
 Key: FLINK-18774
 URL: https://issues.apache.org/jira/browse/FLINK-18774
 Project: Flink
  Issue Type: New Feature
  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Reporter: Jark Wu
 Fix For: 1.12.0


Debezium+Avro+Confluent Schema Registry is a popular pattern in the industry. 
It would be great if we can support this. This depends on the implementation of 
{{avro-confluent}} format (FLINK-16048). 

The format name is up to discuss. I would propose to use 
{{debezium-avro-confluent}} to make it explicitly. As we may support Apicurio 
Registry in the future. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Flavio Pompermaier
Just to contribute to the discussion, when we tried to do the migration we
faced some problems that could make migration quite difficult.
1 - It's difficult to test because of
https://issues.apache.org/jira/browse/FLINK-18647
2 - missing mapPartition
3 - missing   DataSet runOperation(CustomUnaryOperation
operation)

On Thu, Jul 30, 2020 at 12:40 PM Arvid Heise  wrote:

> +1 of getting rid of the DataSet API. Is DataStream#iterate already
> superseding DataSet iterations or would that also need to be accounted for?
>
> In general, all surviving APIs should also offer a smooth experience for
> switching back and forth.
>
> On Thu, Jul 30, 2020 at 9:39 AM Márton Balassi 
> wrote:
>
> > Hi All,
> >
> > Thanks for the write up and starting the discussion. I am in favor of
> > unifying the APIs the way described in the FLIP and deprecating the
> DataSet
> > API. I am looking forward to the detailed discussion of the changes
> > necessary.
> >
> > Best,
> > Marton
> >
> > On Wed, Jul 29, 2020 at 12:46 PM Aljoscha Krettek 
> > wrote:
> >
> >> Hi Everyone,
> >>
> >> my colleagues (in cc) and I would like to propose this FLIP for
> >> discussion. In short, we want to reduce the number of APIs that we have
> >> by deprecating the DataSet API. This is a big step for Flink, that's why
> >> I'm also cross-posting this to the User Mailing List.
> >>
> >> FLIP-131: http://s.apache.org/FLIP-131
> >>
> >> I'm posting the introduction of the FLIP below but please refer to the
> >> document linked above for the full details:
> >>
> >> --
> >> Flink provides three main SDKs/APIs for writing Dataflow Programs: Table
> >> API/SQL, the DataStream API, and the DataSet API. We believe that this
> >> is one API too many and propose to deprecate the DataSet API in favor of
> >> the Table API/SQL and the DataStream API. Of course, this is easier said
> >> than done, so in the following, we will outline why we think that having
> >> too many APIs is detrimental to the project and community. We will then
> >> describe how we can enhance the Table API/SQL and the DataStream API to
> >> subsume the DataSet API's functionality.
> >>
> >> In this FLIP, we will not describe all the technical details of how the
> >> Table API/SQL and DataStream will be enhanced. The goal is to achieve
> >> consensus on the idea of deprecating the DataSet API. There will have to
> >> be follow-up FLIPs that describe the necessary changes for the APIs that
> >> we maintain.
> >> --
> >>
> >> Please let us know if you have any concerns or comments. Also, please
> >> keep discussion to this ML thread instead of commenting in the Wiki so
> >> that we can have a consistent view of the discussion.
> >>
> >> Best,
> >> Aljoscha
> >>
> >
>
> --
>
> Arvid Heise | Senior Java Developer
>
> 
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward  - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Toni) Cheng


Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Seth Wiesman
+1 Its time to drop DataSet

Flavio, those issues are expected. This FLIP isn't just to drop DataSet but
to also add the necessary enhancements to DataStream such that it works
well on bounded input.

On Thu, Jul 30, 2020 at 8:49 AM Flavio Pompermaier 
wrote:

> Just to contribute to the discussion, when we tried to do the migration we
> faced some problems that could make migration quite difficult.
> 1 - It's difficult to test because of
> https://issues.apache.org/jira/browse/FLINK-18647
> 2 - missing mapPartition
> 3 - missing   DataSet runOperation(CustomUnaryOperation
> operation)
>
> On Thu, Jul 30, 2020 at 12:40 PM Arvid Heise  wrote:
>
> > +1 of getting rid of the DataSet API. Is DataStream#iterate already
> > superseding DataSet iterations or would that also need to be accounted
> for?
> >
> > In general, all surviving APIs should also offer a smooth experience for
> > switching back and forth.
> >
> > On Thu, Jul 30, 2020 at 9:39 AM Márton Balassi  >
> > wrote:
> >
> > > Hi All,
> > >
> > > Thanks for the write up and starting the discussion. I am in favor of
> > > unifying the APIs the way described in the FLIP and deprecating the
> > DataSet
> > > API. I am looking forward to the detailed discussion of the changes
> > > necessary.
> > >
> > > Best,
> > > Marton
> > >
> > > On Wed, Jul 29, 2020 at 12:46 PM Aljoscha Krettek  >
> > > wrote:
> > >
> > >> Hi Everyone,
> > >>
> > >> my colleagues (in cc) and I would like to propose this FLIP for
> > >> discussion. In short, we want to reduce the number of APIs that we
> have
> > >> by deprecating the DataSet API. This is a big step for Flink, that's
> why
> > >> I'm also cross-posting this to the User Mailing List.
> > >>
> > >> FLIP-131: http://s.apache.org/FLIP-131
> > >>
> > >> I'm posting the introduction of the FLIP below but please refer to the
> > >> document linked above for the full details:
> > >>
> > >> --
> > >> Flink provides three main SDKs/APIs for writing Dataflow Programs:
> Table
> > >> API/SQL, the DataStream API, and the DataSet API. We believe that this
> > >> is one API too many and propose to deprecate the DataSet API in favor
> of
> > >> the Table API/SQL and the DataStream API. Of course, this is easier
> said
> > >> than done, so in the following, we will outline why we think that
> having
> > >> too many APIs is detrimental to the project and community. We will
> then
> > >> describe how we can enhance the Table API/SQL and the DataStream API
> to
> > >> subsume the DataSet API's functionality.
> > >>
> > >> In this FLIP, we will not describe all the technical details of how
> the
> > >> Table API/SQL and DataStream will be enhanced. The goal is to achieve
> > >> consensus on the idea of deprecating the DataSet API. There will have
> to
> > >> be follow-up FLIPs that describe the necessary changes for the APIs
> that
> > >> we maintain.
> > >> --
> > >>
> > >> Please let us know if you have any concerns or comments. Also, please
> > >> keep discussion to this ML thread instead of commenting in the Wiki so
> > >> that we can have a consistent view of the discussion.
> > >>
> > >> Best,
> > >> Aljoscha
> > >>
> > >
> >
> > --
> >
> > Arvid Heise | Senior Java Developer
> >
> > 
> >
> > Follow us @VervericaData
> >
> > --
> >
> > Join Flink Forward  - The Apache Flink
> > Conference
> >
> > Stream Processing | Event Driven | Real Time
> >
> > --
> >
> > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> >
> > --
> > Ververica GmbH
> > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> > (Toni) Cheng
>


Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Flavio Pompermaier
I just wanted to be propositive about missing api.. :D

On Thu, Jul 30, 2020 at 4:29 PM Seth Wiesman  wrote:

> +1 Its time to drop DataSet
>
> Flavio, those issues are expected. This FLIP isn't just to drop DataSet
> but to also add the necessary enhancements to DataStream such that it works
> well on bounded input.
>
> On Thu, Jul 30, 2020 at 8:49 AM Flavio Pompermaier 
> wrote:
>
>> Just to contribute to the discussion, when we tried to do the migration we
>> faced some problems that could make migration quite difficult.
>> 1 - It's difficult to test because of
>> https://issues.apache.org/jira/browse/FLINK-18647
>> 2 - missing mapPartition
>> 3 - missing   DataSet runOperation(CustomUnaryOperation
>> operation)
>>
>> On Thu, Jul 30, 2020 at 12:40 PM Arvid Heise  wrote:
>>
>> > +1 of getting rid of the DataSet API. Is DataStream#iterate already
>> > superseding DataSet iterations or would that also need to be accounted
>> for?
>> >
>> > In general, all surviving APIs should also offer a smooth experience for
>> > switching back and forth.
>> >
>> > On Thu, Jul 30, 2020 at 9:39 AM Márton Balassi <
>> balassi.mar...@gmail.com>
>> > wrote:
>> >
>> > > Hi All,
>> > >
>> > > Thanks for the write up and starting the discussion. I am in favor of
>> > > unifying the APIs the way described in the FLIP and deprecating the
>> > DataSet
>> > > API. I am looking forward to the detailed discussion of the changes
>> > > necessary.
>> > >
>> > > Best,
>> > > Marton
>> > >
>> > > On Wed, Jul 29, 2020 at 12:46 PM Aljoscha Krettek <
>> aljos...@apache.org>
>> > > wrote:
>> > >
>> > >> Hi Everyone,
>> > >>
>> > >> my colleagues (in cc) and I would like to propose this FLIP for
>> > >> discussion. In short, we want to reduce the number of APIs that we
>> have
>> > >> by deprecating the DataSet API. This is a big step for Flink, that's
>> why
>> > >> I'm also cross-posting this to the User Mailing List.
>> > >>
>> > >> FLIP-131: http://s.apache.org/FLIP-131
>> > >>
>> > >> I'm posting the introduction of the FLIP below but please refer to
>> the
>> > >> document linked above for the full details:
>> > >>
>> > >> --
>> > >> Flink provides three main SDKs/APIs for writing Dataflow Programs:
>> Table
>> > >> API/SQL, the DataStream API, and the DataSet API. We believe that
>> this
>> > >> is one API too many and propose to deprecate the DataSet API in
>> favor of
>> > >> the Table API/SQL and the DataStream API. Of course, this is easier
>> said
>> > >> than done, so in the following, we will outline why we think that
>> having
>> > >> too many APIs is detrimental to the project and community. We will
>> then
>> > >> describe how we can enhance the Table API/SQL and the DataStream API
>> to
>> > >> subsume the DataSet API's functionality.
>> > >>
>> > >> In this FLIP, we will not describe all the technical details of how
>> the
>> > >> Table API/SQL and DataStream will be enhanced. The goal is to achieve
>> > >> consensus on the idea of deprecating the DataSet API. There will
>> have to
>> > >> be follow-up FLIPs that describe the necessary changes for the APIs
>> that
>> > >> we maintain.
>> > >> --
>> > >>
>> > >> Please let us know if you have any concerns or comments. Also, please
>> > >> keep discussion to this ML thread instead of commenting in the Wiki
>> so
>> > >> that we can have a consistent view of the discussion.
>> > >>
>> > >> Best,
>> > >> Aljoscha
>> > >>
>> > >
>> >
>> > --
>> >
>> > Arvid Heise | Senior Java Developer
>> >
>> > 
>> >
>> > Follow us @VervericaData
>> >
>> > --
>> >
>> > Join Flink Forward  - The Apache Flink
>> > Conference
>> >
>> > Stream Processing | Event Driven | Real Time
>> >
>> > --
>> >
>> > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>> >
>> > --
>> > Ververica GmbH
>> > Registered at Amtsgericht Charlottenburg: HRB 158244 B
>> > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
>> > (Toni) Cheng
>
>


Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Aljoscha Krettek
That is good input! I was not aware that anyone was actually using 
`runCustomOperation()`. Out of curiosity, what are you using that for?


We have definitely thought about the first two points you mentioned, 
though. Especially processing-time will make it tricky to define unified 
execution semantics.


Best,
Aljoscha

On 30.07.20 17:10, Flavio Pompermaier wrote:

I just wanted to be propositive about missing api.. :D

On Thu, Jul 30, 2020 at 4:29 PM Seth Wiesman  wrote:


+1 Its time to drop DataSet

Flavio, those issues are expected. This FLIP isn't just to drop DataSet
but to also add the necessary enhancements to DataStream such that it works
well on bounded input.

On Thu, Jul 30, 2020 at 8:49 AM Flavio Pompermaier 
wrote:


Just to contribute to the discussion, when we tried to do the migration we
faced some problems that could make migration quite difficult.
1 - It's difficult to test because of
https://issues.apache.org/jira/browse/FLINK-18647
2 - missing mapPartition
3 - missing   DataSet runOperation(CustomUnaryOperation
operation)

On Thu, Jul 30, 2020 at 12:40 PM Arvid Heise  wrote:


+1 of getting rid of the DataSet API. Is DataStream#iterate already
superseding DataSet iterations or would that also need to be accounted

for?


In general, all surviving APIs should also offer a smooth experience for
switching back and forth.

On Thu, Jul 30, 2020 at 9:39 AM Márton Balassi <

balassi.mar...@gmail.com>

wrote:


Hi All,

Thanks for the write up and starting the discussion. I am in favor of
unifying the APIs the way described in the FLIP and deprecating the

DataSet

API. I am looking forward to the detailed discussion of the changes
necessary.

Best,
Marton

On Wed, Jul 29, 2020 at 12:46 PM Aljoscha Krettek <

aljos...@apache.org>

wrote:


Hi Everyone,

my colleagues (in cc) and I would like to propose this FLIP for
discussion. In short, we want to reduce the number of APIs that we

have

by deprecating the DataSet API. This is a big step for Flink, that's

why

I'm also cross-posting this to the User Mailing List.

FLIP-131: http://s.apache.org/FLIP-131

I'm posting the introduction of the FLIP below but please refer to

the

document linked above for the full details:

--
Flink provides three main SDKs/APIs for writing Dataflow Programs:

Table

API/SQL, the DataStream API, and the DataSet API. We believe that

this

is one API too many and propose to deprecate the DataSet API in

favor of

the Table API/SQL and the DataStream API. Of course, this is easier

said

than done, so in the following, we will outline why we think that

having

too many APIs is detrimental to the project and community. We will

then

describe how we can enhance the Table API/SQL and the DataStream API

to

subsume the DataSet API's functionality.

In this FLIP, we will not describe all the technical details of how

the

Table API/SQL and DataStream will be enhanced. The goal is to achieve
consensus on the idea of deprecating the DataSet API. There will

have to

be follow-up FLIPs that describe the necessary changes for the APIs

that

we maintain.
--

Please let us know if you have any concerns or comments. Also, please
keep discussion to this ML thread instead of commenting in the Wiki

so

that we can have a consistent view of the discussion.

Best,
Aljoscha





--

Arvid Heise | Senior Java Developer



Follow us @VervericaData

--

Join Flink Forward  - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng









Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Flavio Pompermaier
We use runCustomOperation to group a set of operators and into a single
functional unit, just to make the code more modular..
It's very comfortable indeed.

On Thu, Jul 30, 2020 at 5:20 PM Aljoscha Krettek 
wrote:

> That is good input! I was not aware that anyone was actually using
> `runCustomOperation()`. Out of curiosity, what are you using that for?
>
> We have definitely thought about the first two points you mentioned,
> though. Especially processing-time will make it tricky to define unified
> execution semantics.
>
> Best,
> Aljoscha
>
> On 30.07.20 17:10, Flavio Pompermaier wrote:
> > I just wanted to be propositive about missing api.. :D
> >
> > On Thu, Jul 30, 2020 at 4:29 PM Seth Wiesman 
> wrote:
> >
> >> +1 Its time to drop DataSet
> >>
> >> Flavio, those issues are expected. This FLIP isn't just to drop DataSet
> >> but to also add the necessary enhancements to DataStream such that it
> works
> >> well on bounded input.
> >>
> >> On Thu, Jul 30, 2020 at 8:49 AM Flavio Pompermaier <
> pomperma...@okkam.it>
> >> wrote:
> >>
> >>> Just to contribute to the discussion, when we tried to do the
> migration we
> >>> faced some problems that could make migration quite difficult.
> >>> 1 - It's difficult to test because of
> >>> https://issues.apache.org/jira/browse/FLINK-18647
> >>> 2 - missing mapPartition
> >>> 3 - missing   DataSet runOperation(CustomUnaryOperation
> >>> operation)
> >>>
> >>> On Thu, Jul 30, 2020 at 12:40 PM Arvid Heise 
> wrote:
> >>>
>  +1 of getting rid of the DataSet API. Is DataStream#iterate already
>  superseding DataSet iterations or would that also need to be accounted
> >>> for?
> 
>  In general, all surviving APIs should also offer a smooth experience
> for
>  switching back and forth.
> 
>  On Thu, Jul 30, 2020 at 9:39 AM Márton Balassi <
> >>> balassi.mar...@gmail.com>
>  wrote:
> 
> > Hi All,
> >
> > Thanks for the write up and starting the discussion. I am in favor of
> > unifying the APIs the way described in the FLIP and deprecating the
>  DataSet
> > API. I am looking forward to the detailed discussion of the changes
> > necessary.
> >
> > Best,
> > Marton
> >
> > On Wed, Jul 29, 2020 at 12:46 PM Aljoscha Krettek <
> >>> aljos...@apache.org>
> > wrote:
> >
> >> Hi Everyone,
> >>
> >> my colleagues (in cc) and I would like to propose this FLIP for
> >> discussion. In short, we want to reduce the number of APIs that we
> >>> have
> >> by deprecating the DataSet API. This is a big step for Flink, that's
> >>> why
> >> I'm also cross-posting this to the User Mailing List.
> >>
> >> FLIP-131: http://s.apache.org/FLIP-131
> >>
> >> I'm posting the introduction of the FLIP below but please refer to
> >>> the
> >> document linked above for the full details:
> >>
> >> --
> >> Flink provides three main SDKs/APIs for writing Dataflow Programs:
> >>> Table
> >> API/SQL, the DataStream API, and the DataSet API. We believe that
> >>> this
> >> is one API too many and propose to deprecate the DataSet API in
> >>> favor of
> >> the Table API/SQL and the DataStream API. Of course, this is easier
> >>> said
> >> than done, so in the following, we will outline why we think that
> >>> having
> >> too many APIs is detrimental to the project and community. We will
> >>> then
> >> describe how we can enhance the Table API/SQL and the DataStream API
> >>> to
> >> subsume the DataSet API's functionality.
> >>
> >> In this FLIP, we will not describe all the technical details of how
> >>> the
> >> Table API/SQL and DataStream will be enhanced. The goal is to
> achieve
> >> consensus on the idea of deprecating the DataSet API. There will
> >>> have to
> >> be follow-up FLIPs that describe the necessary changes for the APIs
> >>> that
> >> we maintain.
> >> --
> >>
> >> Please let us know if you have any concerns or comments. Also,
> please
> >> keep discussion to this ML thread instead of commenting in the Wiki
> >>> so
> >> that we can have a consistent view of the discussion.
> >>
> >> Best,
> >> Aljoscha
> >>
> >
> 
>  --
> 
>  Arvid Heise | Senior Java Developer
> 
>  
> 
>  Follow us @VervericaData
> 
>  --
> 
>  Join Flink Forward  - The Apache Flink
>  Conference
> 
>  Stream Processing | Event Driven | Real Time
> 
>  --
> 
>  Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> 
>  --
>  Ververica GmbH
>  Registered at Amtsgericht Charlottenburg: HRB 158244 B
>  Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason,
> Ji
>  (Toni) Cheng
> >>
> >>
> >
>
>


Re: [DISCUSS] FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the DataSet API)

2020-07-30 Thread Aljoscha Krettek
I see, we actually have some thoughts along that line as well. We have 
ideas about adding such functionality for `Transformation`, which is the 
graph structure that underlies both the DataStream API and the newer 
Table API Runner/Planner.


There a very rough PoC for that available at [1]. It's a very contrived 
example but it shows off what would be possible. The `Sink` interface 
here is just a subclass of the general `TransformationApply` [2] and we 
could envision a `DataStream.apply()` that let's you apply these general 
transformation "bundles".


Keep in mind that this is just rough early ideas and the naming/location 
of things is somewhat rough. And we might not do it like this in the end.


Best,
Aljoscha

[1] 
https://github.com/aljoscha/flink/blob/poc-transform-apply-sink/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/windowing/SinkExample.java


[2] 
https://github.com/aljoscha/flink/blob/poc-transform-apply-sink/flink-core/src/main/java/org/apache/flink/api/dag/TransformationApply.java


On 30.07.20 17:26, Flavio Pompermaier wrote:

We use runCustomOperation to group a set of operators and into a single
functional unit, just to make the code more modular..
It's very comfortable indeed.

On Thu, Jul 30, 2020 at 5:20 PM Aljoscha Krettek 
wrote:


That is good input! I was not aware that anyone was actually using
`runCustomOperation()`. Out of curiosity, what are you using that for?

We have definitely thought about the first two points you mentioned,
though. Especially processing-time will make it tricky to define unified
execution semantics.

Best,
Aljoscha

On 30.07.20 17:10, Flavio Pompermaier wrote:

I just wanted to be propositive about missing api.. :D

On Thu, Jul 30, 2020 at 4:29 PM Seth Wiesman 

wrote:



+1 Its time to drop DataSet

Flavio, those issues are expected. This FLIP isn't just to drop DataSet
but to also add the necessary enhancements to DataStream such that it

works

well on bounded input.

On Thu, Jul 30, 2020 at 8:49 AM Flavio Pompermaier <

pomperma...@okkam.it>

wrote:


Just to contribute to the discussion, when we tried to do the

migration we

faced some problems that could make migration quite difficult.
1 - It's difficult to test because of
https://issues.apache.org/jira/browse/FLINK-18647
2 - missing mapPartition
3 - missing   DataSet runOperation(CustomUnaryOperation
operation)

On Thu, Jul 30, 2020 at 12:40 PM Arvid Heise 

wrote:



+1 of getting rid of the DataSet API. Is DataStream#iterate already
superseding DataSet iterations or would that also need to be accounted

for?


In general, all surviving APIs should also offer a smooth experience

for

switching back and forth.

On Thu, Jul 30, 2020 at 9:39 AM Márton Balassi <

balassi.mar...@gmail.com>

wrote:


Hi All,

Thanks for the write up and starting the discussion. I am in favor of
unifying the APIs the way described in the FLIP and deprecating the

DataSet

API. I am looking forward to the detailed discussion of the changes
necessary.

Best,
Marton

On Wed, Jul 29, 2020 at 12:46 PM Aljoscha Krettek <

aljos...@apache.org>

wrote:


Hi Everyone,

my colleagues (in cc) and I would like to propose this FLIP for
discussion. In short, we want to reduce the number of APIs that we

have

by deprecating the DataSet API. This is a big step for Flink, that's

why

I'm also cross-posting this to the User Mailing List.

FLIP-131: http://s.apache.org/FLIP-131

I'm posting the introduction of the FLIP below but please refer to

the

document linked above for the full details:

--
Flink provides three main SDKs/APIs for writing Dataflow Programs:

Table

API/SQL, the DataStream API, and the DataSet API. We believe that

this

is one API too many and propose to deprecate the DataSet API in

favor of

the Table API/SQL and the DataStream API. Of course, this is easier

said

than done, so in the following, we will outline why we think that

having

too many APIs is detrimental to the project and community. We will

then

describe how we can enhance the Table API/SQL and the DataStream API

to

subsume the DataSet API's functionality.

In this FLIP, we will not describe all the technical details of how

the

Table API/SQL and DataStream will be enhanced. The goal is to

achieve

consensus on the idea of deprecating the DataSet API. There will

have to

be follow-up FLIPs that describe the necessary changes for the APIs

that

we maintain.
--

Please let us know if you have any concerns or comments. Also,

please

keep discussion to this ML thread instead of commenting in the Wiki

so

that we can have a consistent view of the discussion.

Best,
Aljoscha





--

Arvid Heise | Senior Java Developer



Follow us @VervericaData

--

Join Flink Forward  - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered

Re: [DISCUSS] FLIP-132: Temporal Table DDL

2020-07-30 Thread Seth Wiesman
Hi Leondard,

Thank you for pushing this, I think the updated syntax looks really good
and the semantics make sense to me.

+1

Seth

On Wed, Jul 29, 2020 at 11:36 AM Leonard Xu  wrote:

> Hi, Konstantin
>
> >
> > 1) A  "Versioned Temporal Table DDL on source" can only be joined on the
> > PRIMARY KEY attribute, correct?
> Yes, the PRIMARY KEY would be join key.
>
> >
> > 2) Isn't it the time attribute in the ORDER BY clause of the VIEW
> definition that defines
> > whether a event-time or processing time temporal table join is used?
>
> I think event-time or processing-time temporal table join depends on fact
> table’s time attribute in temporal join rather than from temporal table
> side, the event-time or processing time in temporal table is just used to
> split the validity period of versioned snapshot of temporal table. The
> processing time attribute is not  necessary for temporal table without
> version, only the primary key is required, the following VIEW is also valid
> for temporal table without version.
> CREATE VIEW latest_rates AS
> SELECT currency, LAST_VALUE(rate)-- only keep the latest
> version
> FROM rates
> GROUP BY currency;   -- inferred primary key
>
>
> >
> > 3) A "Versioned Temporal Table DDL on source" is always versioned on
> > operation_time regardless of the lookup table attribute (event-time or
> > processing time attribute), correct?
>
>
> Yes, the semantics of `FOR SYSTEM_TIME AS OF o.time` is using the o.time
> value to lookup the version of the temporal table.
> For fact table has the processing time attribute, it means only lookup the
> latest version of temporal table and we can do some optimization in
> implementation like only keep the latest version.
>
>
> Best
> Leonard


[jira] [Created] (FLINK-18775) Rework PyFlink Documentation

2020-07-30 Thread sunjincheng (Jira)
sunjincheng created FLINK-18775:
---

 Summary: Rework PyFlink Documentation
 Key: FLINK-18775
 URL: https://issues.apache.org/jira/browse/FLINK-18775
 Project: Flink
  Issue Type: Improvement
  Components: API / Python, Documentation
Affects Versions: 1.11.1, 1.11.0
Reporter: sunjincheng
 Fix For: 1.12.0, 1.11.2, 1.11.1, 1.11.0


Since the release of Flink 1.11, users of PyFlink have continued to grow. 
According to the feedback we received, current Flink documentation is not very 
friendly to PyFlink users. There are two shortcomings:
 # Python related content is mixed in the Java/Scala documentation, which makes 
it difficult for users who only focus on PyFlink to read.
 # There is already a "Python Table API" section in the Table API document to 
store PyFlink documents, but the number of articles is small and the content is 
fragmented. It is difficult for beginners to learn from it.

In addition, 
[FLIP-130|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298]
 introduced the Python DataStream API. Many documents will be added for those 
new APIs. In order to increase the readability and maintainability of the 
PyFlink document, we would like to rework it via this umbrella JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18776) "compile_cron_scala212" failed to compile

2020-07-30 Thread Dian Fu (Jira)
Dian Fu created FLINK-18776:
---

 Summary: "compile_cron_scala212" failed to compile
 Key: FLINK-18776
 URL: https://issues.apache.org/jira/browse/FLINK-18776
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.12.0
Reporter: Dian Fu
 Fix For: 1.12.0


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5060&view=logs&j=ed6509f5-1153-558c-557a-5ee0afbcdf24&t=241b1e5e-1a8e-5e6a-469a-a9b8cad87065

{code}
[INFO] --- maven-enforcer-plugin:3.0.0-M1:enforce (enforce-versions) @ 
flink-avro-confluent-registry ---
[WARNING] Rule 0: org.apache.maven.plugins.enforcer.BannedDependencies failed 
with message:
Found Banned Dependency: com.typesafe:ssl-config-core_2.11:jar:0.3.7
Found Banned Dependency: com.typesafe.akka:akka-slf4j_2.11:jar:2.5.21
Found Banned Dependency: com.typesafe.akka:akka-actor_2.11:jar:2.5.21
Found Banned Dependency: 
org.scala-lang.modules:scala-java8-compat_2.11:jar:0.7.0
Found Banned Dependency: com.typesafe.akka:akka-protobuf_2.11:jar:2.5.21
Found Banned Dependency: 
org.apache.flink:flink-table-api-java-bridge_2.11:jar:1.12-SNAPSHOT
Found Banned Dependency: 
org.apache.flink:flink-table-runtime-blink_2.11:jar:1.12-SNAPSHOT
Found Banned Dependency: com.typesafe.akka:akka-stream_2.11:jar:2.5.21
Found Banned Dependency: com.github.scopt:scopt_2.11:jar:3.5.0
Found Banned Dependency: org.apache.flink:flink-runtime_2.11:jar:1.12-SNAPSHOT
Found Banned Dependency: 
org.scala-lang.modules:scala-parser-combinators_2.11:jar:1.1.1
Found Banned Dependency: com.twitter:chill_2.11:jar:0.7.6
Found Banned Dependency: org.clapper:grizzled-slf4j_2.11:jar:1.3.2
Found Banned Dependency: 
org.apache.flink:flink-streaming-java_2.11:jar:1.12-SNAPSHOT
Use 'mvn dependency:tree' to locate the source of the banned dependencies.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-07-30 Thread jincheng sun
Hi folks,

Since the release of Flink 1.11, users of PyFlink have continued to grow.
As far as I know there are many companies have used PyFlink for data
analysis, operation and maintenance monitoring business has been put into
production(Such as 聚美优品[1](Jumei),  浙江墨芷[2] (Mozhi) etc.).  According to
the feedback we received, current documentation is not very friendly to
PyFlink users. There are two shortcomings:

- Python related content is mixed in the Java/Scala documentation, which
makes it difficult for users who only focus on PyFlink to read.
- There is already a "Python Table API" section in the Table API document
to store PyFlink documents, but the number of articles is small and the
content is fragmented. It is difficult for beginners to learn from it.

In addition, FLIP-130 introduced the Python DataStream API. Many documents
will be added for those new APIs. In order to increase the readability and
maintainability of the PyFlink document, Wei Zhong and me have discussed
offline and would like to rework it via this FLIP.

We will rework the document around the following three objectives:

- Add a separate section for Python API under the "Application Development"
section.
- Restructure current Python documentation to a brand new structure to
ensure complete content and friendly to beginners.
- Improve the documents shared by Python/Java/Scala to make it more
friendly to Python users and without affecting Java/Scala users.

More detail can be found in the FLIP-133:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-133%3A+Rework+PyFlink+Documentation

Best,
Jincheng

[1] https://mp.weixin.qq.com/s/zVsBIs1ZEFe4atYUYtZpRg
[2] https://mp.weixin.qq.com/s/R4p_a2TWGpESBWr3pLtM2g


Re: [DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-07-30 Thread Dian Fu
Hi Jincheng,

Thanks a lot for bringing up this discussion and the proposal. +1 to improve 
the Python API doc.

I have received many feedbacks from PyFlink beginners about the PyFlink doc, 
e.g. the materials are too few, the Python doc is mixed with the Java doc and 
it's not easy to find the docs he wants to know.

I think it would greatly improve the user experience if we can have one place 
which includes most knowledges PyFlink users should know.

Regards,
Dian

> 在 2020年7月31日,上午10:14,jincheng sun  写道:
> 
> Hi folks,
> 
> Since the release of Flink 1.11, users of PyFlink have continued to grow. As 
> far as I know there are many companies have used PyFlink for data analysis, 
> operation and maintenance monitoring business has been put into 
> production(Such as 聚美优品[1](Jumei),  浙江墨芷[2] (Mozhi) etc.).  According to the 
> feedback we received, current documentation is not very friendly to PyFlink 
> users. There are two shortcomings:
> 
> - Python related content is mixed in the Java/Scala documentation, which 
> makes it difficult for users who only focus on PyFlink to read.
> - There is already a "Python Table API" section in the Table API document to 
> store PyFlink documents, but the number of articles is small and the content 
> is fragmented. It is difficult for beginners to learn from it.
> 
> In addition, FLIP-130 introduced the Python DataStream API. Many documents 
> will be added for those new APIs. In order to increase the readability and 
> maintainability of the PyFlink document, Wei Zhong and me have discussed 
> offline and would like to rework it via this FLIP.
> 
> We will rework the document around the following three objectives:
> 
> - Add a separate section for Python API under the "Application Development" 
> section.
> - Restructure current Python documentation to a brand new structure to ensure 
> complete content and friendly to beginners.
> - Improve the documents shared by Python/Java/Scala to make it more friendly 
> to Python users and without affecting Java/Scala users.
> 
> More detail can be found in the FLIP-133: 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-133%3A+Rework+PyFlink+Documentation
>  
> 
> 
> Best,
> Jincheng
> 
> [1] https://mp.weixin.qq.com/s/zVsBIs1ZEFe4atYUYtZpRg 
> 
> [2] https://mp.weixin.qq.com/s/R4p_a2TWGpESBWr3pLtM2g 
> 
> 



Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

2020-07-30 Thread Thomas Weise
I run git bisect and the first commit that shows the regression is:

https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90


On Thu, Jul 23, 2020 at 6:46 PM Kurt Young  wrote:

> From my experience, java profilers are sometimes not accurate enough to
> find out the performance regression
> root cause. In this case, I would suggest you try out intel vtune amplifier
> to watch more detailed metrics.
>
> Best,
> Kurt
>
>
> On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise  wrote:
>
> > The cause of the issue is all but clear.
> >
> > Previously I had mentioned that there is no suspect change to the Kinesis
> > connector and that I had reverted the AWS SDK change to no effect.
> >
> > https://issues.apache.org/jira/browse/FLINK-17496 actually fixed another
> > regression in the previous release and is present before and after.
> >
> > I repeated the run with 1.11.0 core and downgraded the entire Kinesis
> > connector to 1.10.1: Nothing changes, i.e. the regression is still
> present.
> > Therefore we will need to look elsewhere for the root cause.
> >
> > Regarding the time spent in snapshotState, repeat runs reveal a wide
> range
> > for both versions, 1.10 and 1.11. So again this is nothing pointing to a
> > root cause.
> >
> > At this point, I have no ideas remaining other than doing a bisect to
> find
> > the culprit. Any other suggestions?
> >
> > Thomas
> >
> >
> > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang  > .invalid>
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > Thanks for your further profiling information and glad to see we
> already
> > > finalized the location to cause the regression.
> > > Actually I was also suspicious of the point of #snapshotState in
> previous
> > > discussions since it indeed cost much time to block normal operator
> > > processing.
> > >
> > > Based on your below feedback, the sleep time during #snapshotState
> might
> > > be the main concern, and I also digged into the implementation of
> > > FlinkKinesisProducer#snapshotState.
> > > while (producer.getOutstandingRecordsCount() > 0) {
> > >producer.flush();
> > >try {
> > >   Thread.sleep(500);
> > >} catch (InterruptedException e) {
> > >   LOG.warn("Flushing was interrupted.");
> > >   break;
> > >}
> > > }
> > > It seems that the sleep time is mainly affected by the internal
> > operations
> > > inside KinesisProducer implementation provided by amazonaws, which I am
> > not
> > > quite familiar with.
> > > But I noticed there were two upgrades related to it in release-1.11.0.
> > One
> > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and another is
> for
> > > upgrading aws-sdk-version to 1.11.754 [2].
> > > You mentioned that you already reverted the SDK upgrade to verify no
> > > changes. Did you also revert the [1] to verify?
> > > [1] https://issues.apache.org/jira/browse/FLINK-17496
> > > [2] https://issues.apache.org/jira/browse/FLINK-14881
> > >
> > > Best,
> > > Zhijiang
> > > --
> > > From:Thomas Weise 
> > > Send Time:2020年7月17日(星期五) 05:29
> > > To:dev 
> > > Cc:Zhijiang ; Stephan Ewen <
> se...@apache.org
> > >;
> > > Arvid Heise ; Aljoscha Krettek <
> aljos...@apache.org
> > >
> > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> release
> > > candidate #4)
> > >
> > > Sorry for the delay.
> > >
> > > I confirmed that the regression is due to the sink (unsurprising, since
> > > another job with the same consumer, but not the producer, runs as
> > > expected).
> > >
> > > As promised I did CPU profiling on the problematic application, which
> > gives
> > > more insight into the regression [1]
> > >
> > > The screenshots show that the average time for snapshotState increases
> > from
> > > ~9s to ~28s. The data also shows the increase in sleep time during
> > > snapshotState.
> > >
> > > Does anyone, based on changes made in 1.11, have a theory why?
> > >
> > > I had previously looked at the changes to the Kinesis connector and
> also
> > > reverted the SDK upgrade, which did not change the situation.
> > >
> > > It will likely be necessary to drill into the sink / checkpointing
> > details
> > > to understand the cause of the problem.
> > >
> > > Let me know if anyone has specific questions that I can answer from the
> > > profiling results.
> > >
> > > Thomas
> > >
> > > [1]
> > >
> > >
> >
> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
> > >
> > > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise  wrote:
> > >
> > > > + dev@ for visibility
> > > >
> > > > I will investigate further today.
> > > >
> > > >
> > > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek  >
> > > > wrote:
> > > >
> > > >> On 06.07.20 20:39, Stephan Ewen wrote:
> > > >> >- Did sink checkpoint notifications change in a relevant way,
> for
> > > >> example
> > > >> > due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
> > > >>
> > > >

Re: [DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-07-30 Thread Xingbo Huang
Hi Jincheng,

Thanks a lot for bringing up this discussion and the proposal.

Big +1 for improving the structure of PyFlink doc.

It will be very friendly to give PyFlink users a unified entrance to learn
PyFlink documents.

Best,
Xingbo

Dian Fu  于2020年7月31日周五 上午11:00写道:

> Hi Jincheng,
>
> Thanks a lot for bringing up this discussion and the proposal. +1 to
> improve the Python API doc.
>
> I have received many feedbacks from PyFlink beginners about
> the PyFlink doc, e.g. the materials are too few, the Python doc is mixed
> with the Java doc and it's not easy to find the docs he wants to know.
>
> I think it would greatly improve the user experience if we can have one
> place which includes most knowledges PyFlink users should know.
>
> Regards,
> Dian
>
> 在 2020年7月31日,上午10:14,jincheng sun  写道:
>
> Hi folks,
>
> Since the release of Flink 1.11, users of PyFlink have continued to grow.
> As far as I know there are many companies have used PyFlink for data
> analysis, operation and maintenance monitoring business has been put into
> production(Such as 聚美优品[1](Jumei),  浙江墨芷[2] (Mozhi) etc.).  According to
> the feedback we received, current documentation is not very friendly to
> PyFlink users. There are two shortcomings:
>
> - Python related content is mixed in the Java/Scala documentation, which
> makes it difficult for users who only focus on PyFlink to read.
> - There is already a "Python Table API" section in the Table API document
> to store PyFlink documents, but the number of articles is small and the
> content is fragmented. It is difficult for beginners to learn from it.
>
> In addition, FLIP-130 introduced the Python DataStream API. Many documents
> will be added for those new APIs. In order to increase the readability and
> maintainability of the PyFlink document, Wei Zhong and me have discussed
> offline and would like to rework it via this FLIP.
>
> We will rework the document around the following three objectives:
>
> - Add a separate section for Python API under the "Application
> Development" section.
> - Restructure current Python documentation to a brand new structure to
> ensure complete content and friendly to beginners.
> - Improve the documents shared by Python/Java/Scala to make it more
> friendly to Python users and without affecting Java/Scala users.
>
> More detail can be found in the FLIP-133:
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-133%3A+Rework+PyFlink+Documentation
>
> Best,
> Jincheng
>
> [1] https://mp.weixin.qq.com/s/zVsBIs1ZEFe4atYUYtZpRg
> [2] https://mp.weixin.qq.com/s/R4p_a2TWGpESBWr3pLtM2g
>
>
>


[jira] [Created] (FLINK-18777) Supports schema registry catalog

2020-07-30 Thread Danny Chen (Jira)
Danny Chen created FLINK-18777:
--

 Summary: Supports schema registry catalog
 Key: FLINK-18777
 URL: https://issues.apache.org/jira/browse/FLINK-18777
 Project: Flink
  Issue Type: Bug
  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Affects Versions: 1.11.0
Reporter: Danny Chen
 Fix For: 1.12.0


Design doc: 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-125%3A+Confluent+Schema+Registry+Catalog



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18779) Support the SupportsFilterPushDown interface for ScanTableSource.

2020-07-30 Thread Jark Wu (Jira)
Jark Wu created FLINK-18779:
---

 Summary: Support the SupportsFilterPushDown interface for 
ScanTableSource.
 Key: FLINK-18779
 URL: https://issues.apache.org/jira/browse/FLINK-18779
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Jark Wu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18778) Support the SupportsProjectionPushDown interface for LookupTableSource

2020-07-30 Thread Jark Wu (Jira)
Jark Wu created FLINK-18778:
---

 Summary: Support the SupportsProjectionPushDown interface for 
LookupTableSource
 Key: FLINK-18778
 URL: https://issues.apache.org/jira/browse/FLINK-18778
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Jark Wu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Introduce SupportsParallelismReport and SupportsStatisticsReport for Hive and Filesystem

2020-07-30 Thread Jingsong Li
Hi, thanks for your responses.

To Benchao:

Glad to see your works and requirements, they should be Public.

To Kurt:

1.Regarding "SupportsXXX" for ScanTableSource or LookupTableSource
or DynamicTableSink, I don't think a "SupportsXXX" must work with all these
three types. As Godfrey said, Such as a LookupTableSource should not extend
from SupportsWatermarkPushDown and SupportsComputedColumnPushDown. We just
try our best to make all combinations work, like
"SupportsParallelismReport", it can work with both ScanTableSource
and DynamicTableSink.

About adding the method "reportParallelism" we want directly to
ScanTableSource and DynamicTableSink, I think maybe most of sources/sinks
do not want to see this method, provides a "SupportsXXX" aim to give
connector developer a option selection.

2.Regarding SupportsStatisticsReport doesn't work for unbounded streaming
table sources, yes, it is, the statistics (Including catalog statistics)
are not related to stream tables, but I think, in future, we can create
more useful statistics information for streaming tables.

3."oldStats" in SupportsStatisticsReport, "oldStats" should be named to
"catalogStats", source just try its best to get more useful and accurate
statistic information, but just like Godfrey said, it is a supplement to
catalog statistics, it can just supplement missing or inaccurate
information in the catalog.

4.Internal or Public, I am glad to see your requirements, I am OK with
Public.

To Godfrey:

Regarding If there are multiple Transformations in source op, and they
require different parallelism. In this case, it should be left to the
source to set the parallelism. So, these should be two things that are
orthogonal. Users who do not use multi Transformations still need to set
parallelism.

Best,
Jingsong

On Thu, Jul 30, 2020 at 8:31 PM godfrey he  wrote:

> Thanks Jingsong for bringing up this discussion,
>  and thanks Kurt for the detailed thoughts.
>
> First of all, I also think it's a very useful feature to expose more
> ability for table source.
>
> 1) If we want to support [1], it's seem that SupportsParallelismReport
> does not meet the requirement: If there are multiple Transformations in
> source op,
> and they require different parallelism.
>
> 2) regarding to "SupportsXXX" for ScanTableSource or LookupTableSource,
> Currently, we also do not distinguish them for the existing "SupportsXXX".
> Such as a LookupTableSource should not extend from
> SupportsWatermarkPushDown
> and SupportsComputedColumnPushDown.
> A DynamicTableSource sub-class will extend from "SupportsXXX" only if it
> has the capability,
> So the unbounded table source should not extend from
> SupportsStatisticsReport,
> or just return unknown for unbounded if a table source can work for both
> bounded and unbounded.
>
> I think SupportsStatisticsReport is a supplement to catalog statistics,
> that means
> only catalog statistic is unknown, SupportsStatisticsReport works.
>
> 3)  +1 to make them as public.
>
> [1] https://issues.apache.org/jira/browse/FLINK-18674
>
> Best,
> Godfrey
>
>
>
> Kurt Young  于2020年7月30日周四 下午4:01写道:
>
> > Hi Jingsong,
> >
> > Thanks for bringing up this discussion. In general, I'm +1 to enrich the
> > source ability by
> > the parallelism and stats reporting, but I'm not sure whether introducing
> > such "Supports"
> > interface is a good idea. I will share my thoughts separately.
> >
> > 1) Regarding the interface SupportsParallelismReport, first of all, my
> > feeling is that such a mechanism
> > is not like other abilities like SupportsProjectionPushDown. Parallelism
> of
> > source operator would be
> > decided anyway, the only difference here is whether it's decided purely
> by
> > framework or by table source
> > itself. So another angle to understand this issue is, we can always
> assume
> > a table source has the
> > ability to determine the parallelism. The table source can choose to set
> > the parallelism by itself, or delegate
> > it to the framework.
> >
> > This might sound like personal taste, but there is another bad case if we
> > introduce the interface. You
> > may already know we currently have two major table
> > sources, LookupTableSource and ScanTableSource.
> > IIUC it won't make much sense if the user provides a LookupTableSource
> and
> > also implements
> > SupportsParallelismReport.
> >
> > An alternative solution would be add the method you want directly
> > to ScanTableSource, and also have
> > a default implementation returning -1, which means letting framework to
> > decide the parallelism.
> >
> > 2) Regarding the interface SupportsStatisticsReport, it seems this
> > interface doesn't work for unbounded
> > streaming table sources. What kind of implementation do you expect in
> such
> > a case? And how does this
> > interface work with LookupTableSource?
> > Another question is what the oldStats parameter is used for?
> >
> > 3) Internal or Public. I don't think we should mark them as internal.
> They
> > a