Hi Xiaogang,

I very much agree with Jark's and Aljoscha's responses.


On 10/04/2020 17:35, Jark Wu wrote:
> Hi Xiaogang,
>
> I think this proposal doesn't conflict with your use case, you can still
> chain a ProcessFunction after a source which emits raw data.
> But I'm not in favor of chaining ProcessFunction after source, and we
> should avoid that, because:
>
> 1) For correctness, it is necessary to perform the watermark generation as
> early as possible in order to be close to the actual data
>  generation within a source's data partition. This is also the purpose of
> per-partition watermark and event-time alignment.
>  Many on going FLIPs (e.g. FLIP-27, FLIP-95) works a lot on this effort.
> Deseriazing records and generating watermark in chained
>  ProcessFunction makes it difficult to do per-partition watermark in the
> future.
> 2) In Flink SQL, a source should emit the deserialized row instead of raw
> data. Otherwise, users have to define raw byte[] as the
>  single column of the defined table, and parse them in queries, which is
> very inconvenient.
>
> Best,
> Jark
>
> On Fri, 10 Apr 2020 at 09:18, SHI Xiaogang <shixiaoga...@gmail.com> wrote:
>
>> Hi,
>>
>> I don't think the proposal is a good solution to the problems. I am in
>> favour of using a ProcessFunction chained to the source/sink function to
>> serialize/deserialize the records, instead of embedding (de)serialization
>> schema in source/sink function.
>>
>> Message packing is heavily used in our production environment to allow
>> compression and improve throughput. As buffered messages have to be
>> delivered when the time exceeds the limit, timers are also required in our
>> cases. I think it's also a common need for other users.
>>
>> In the this proposal, with more components added into the context, in the
>> end we will find the serialization/deserialization schema is just another
>> wrapper of ProcessFunction.
>>
>> Regards,
>> Xiaogang
>>
>> Aljoscha Krettek <aljos...@apache.org> 于2020年4月7日周二 下午6:34写道:
>>
>>> On 07.04.20 08:45, Dawid Wysakowicz wrote:
>>>
>>>> @Jark I was aware of the implementation of SinkFunction, but it was a
>>>> conscious choice to not do it that way.
>>>>
>>>> Personally I am against giving a default implementation to both the new
>>>> and old methods. This results in an interface that by default does
>>>> nothing or notifies the user only in the runtime, that he/she has not
>>>> implemented a method of the interface, which does not sound like a good
>>>> practice to me. Moreover I believe the method without a Collector will
>>>> still be the preferred method by many users. Plus it communicates
>>>> explicitly what is the minimal functionality required by the interface.
>>>> Nevertheless I am happy to hear other opinions.
>>> Dawid and I discussed this before. I did the extension of the
>>> SinkFunction but by now I think it's better to do it this way, because
>>> otherwise you can implement the interface without implementing any
>> methods.
>>>> @all I also prefer the buffering approach. Let's wait a day or two more
>>>> to see if others think differently.
>>> I'm also in favour of buffering outside the lock.
>>>
>>> Also, +1 to this FLIP.
>>>
>>> Best,
>>> Aljoscha
>>>

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to