Hi Xiaogang, I very much agree with Jark's and Aljoscha's responses.
On 10/04/2020 17:35, Jark Wu wrote: > Hi Xiaogang, > > I think this proposal doesn't conflict with your use case, you can still > chain a ProcessFunction after a source which emits raw data. > But I'm not in favor of chaining ProcessFunction after source, and we > should avoid that, because: > > 1) For correctness, it is necessary to perform the watermark generation as > early as possible in order to be close to the actual data > generation within a source's data partition. This is also the purpose of > per-partition watermark and event-time alignment. > Many on going FLIPs (e.g. FLIP-27, FLIP-95) works a lot on this effort. > Deseriazing records and generating watermark in chained > ProcessFunction makes it difficult to do per-partition watermark in the > future. > 2) In Flink SQL, a source should emit the deserialized row instead of raw > data. Otherwise, users have to define raw byte[] as the > single column of the defined table, and parse them in queries, which is > very inconvenient. > > Best, > Jark > > On Fri, 10 Apr 2020 at 09:18, SHI Xiaogang <shixiaoga...@gmail.com> wrote: > >> Hi, >> >> I don't think the proposal is a good solution to the problems. I am in >> favour of using a ProcessFunction chained to the source/sink function to >> serialize/deserialize the records, instead of embedding (de)serialization >> schema in source/sink function. >> >> Message packing is heavily used in our production environment to allow >> compression and improve throughput. As buffered messages have to be >> delivered when the time exceeds the limit, timers are also required in our >> cases. I think it's also a common need for other users. >> >> In the this proposal, with more components added into the context, in the >> end we will find the serialization/deserialization schema is just another >> wrapper of ProcessFunction. >> >> Regards, >> Xiaogang >> >> Aljoscha Krettek <aljos...@apache.org> 于2020年4月7日周二 下午6:34写道: >> >>> On 07.04.20 08:45, Dawid Wysakowicz wrote: >>> >>>> @Jark I was aware of the implementation of SinkFunction, but it was a >>>> conscious choice to not do it that way. >>>> >>>> Personally I am against giving a default implementation to both the new >>>> and old methods. This results in an interface that by default does >>>> nothing or notifies the user only in the runtime, that he/she has not >>>> implemented a method of the interface, which does not sound like a good >>>> practice to me. Moreover I believe the method without a Collector will >>>> still be the preferred method by many users. Plus it communicates >>>> explicitly what is the minimal functionality required by the interface. >>>> Nevertheless I am happy to hear other opinions. >>> Dawid and I discussed this before. I did the extension of the >>> SinkFunction but by now I think it's better to do it this way, because >>> otherwise you can implement the interface without implementing any >> methods. >>>> @all I also prefer the buffering approach. Let's wait a day or two more >>>> to see if others think differently. >>> I'm also in favour of buffering outside the lock. >>> >>> Also, +1 to this FLIP. >>> >>> Best, >>> Aljoscha >>>
signature.asc
Description: OpenPGP digital signature