Hi Xiaogang, I think this proposal doesn't conflict with your use case, you can still chain a ProcessFunction after a source which emits raw data. But I'm not in favor of chaining ProcessFunction after source, and we should avoid that, because:
1) For correctness, it is necessary to perform the watermark generation as early as possible in order to be close to the actual data generation within a source's data partition. This is also the purpose of per-partition watermark and event-time alignment. Many on going FLIPs (e.g. FLIP-27, FLIP-95) works a lot on this effort. Deseriazing records and generating watermark in chained ProcessFunction makes it difficult to do per-partition watermark in the future. 2) In Flink SQL, a source should emit the deserialized row instead of raw data. Otherwise, users have to define raw byte[] as the single column of the defined table, and parse them in queries, which is very inconvenient. Best, Jark On Fri, 10 Apr 2020 at 09:18, SHI Xiaogang <shixiaoga...@gmail.com> wrote: > Hi, > > I don't think the proposal is a good solution to the problems. I am in > favour of using a ProcessFunction chained to the source/sink function to > serialize/deserialize the records, instead of embedding (de)serialization > schema in source/sink function. > > Message packing is heavily used in our production environment to allow > compression and improve throughput. As buffered messages have to be > delivered when the time exceeds the limit, timers are also required in our > cases. I think it's also a common need for other users. > > In the this proposal, with more components added into the context, in the > end we will find the serialization/deserialization schema is just another > wrapper of ProcessFunction. > > Regards, > Xiaogang > > Aljoscha Krettek <aljos...@apache.org> 于2020年4月7日周二 下午6:34写道: > > > On 07.04.20 08:45, Dawid Wysakowicz wrote: > > > > > @Jark I was aware of the implementation of SinkFunction, but it was a > > > conscious choice to not do it that way. > > > > > > Personally I am against giving a default implementation to both the new > > > and old methods. This results in an interface that by default does > > > nothing or notifies the user only in the runtime, that he/she has not > > > implemented a method of the interface, which does not sound like a good > > > practice to me. Moreover I believe the method without a Collector will > > > still be the preferred method by many users. Plus it communicates > > > explicitly what is the minimal functionality required by the interface. > > > Nevertheless I am happy to hear other opinions. > > > > Dawid and I discussed this before. I did the extension of the > > SinkFunction but by now I think it's better to do it this way, because > > otherwise you can implement the interface without implementing any > methods. > > > > > @all I also prefer the buffering approach. Let's wait a day or two more > > > to see if others think differently. > > > > I'm also in favour of buffering outside the lock. > > > > Also, +1 to this FLIP. > > > > Best, > > Aljoscha > > >