Re: [Proposal] Change to Default PubsubMessage Coder

Evan Galpin Tue, 20 Dec 2022 13:35:56 -0800

Thanks folks for pointing out RowCoder as a potential alternative, it does
sound like an overall better solution.  I wonder if exploration or RowCoder
should be done as a follow up, given that the issue is marked as P1 and
presumably migration to RowCoder will be more involved and more risky?


Otherwise, it sounds as though there isn't a high level of concern with
respect to changing the default coder, so I'll plan to move forward with
drafting a PR with that change (pending thoughts on RowCoder now Vs. later).

On Mon, Dec 19, 2022 at 4:51 PM Robert Burke <rob...@frantil.com> wrote:

> Technically, since the PubSubMessage is just a protocol buffer, I'm a bit
> surprised we aren't using it's native encoding under the hood.
>
> The Go SDK accepts either byte slices (which it then populates a
> messages's data field) or accept the proto representations which it publish
> straight out.
>
> Similarly for reads, either the data field is extracted as bytes, or the
> whole PubSubMessage.
>
> Granted part of this is just because the coder set up in the Go SDK isn't
> as maliable as Java or Python, and the types aren't recodable between
> transforms.
>
>
>
>
>
> Beam Go Busybody
> Robert Burke
>
> On Mon, Dec 19, 2022, 4:18 PM Ahmed Abualsaud via dev <dev@beam.apache.org>
> wrote:
>
>> +1 to looking into using RowCoder, this may help avoid creating more
>> specialized coders in the future (which is mentioned as a pain point in the
>> issue you linked [1]).
>>
>> [1] https://github.com/apache/beam/issues/23525#issuecomment-1281294275
>>
>> On Tue, Dec 20, 2022 at 3:00 AM Andrew Pilloud via dev <
>> dev@beam.apache.org> wrote:
>>
>>> I think the Dataflow update concern is the biggest concern and as you
>>> point out that can be easily overcome. I generally believe that users who
>>> aren't setting the coder don't actually care as long as it works, so
>>> changing the default across Beam versions seems relatively low
>>> risk. Another mitigating factor is that both concerns require users to
>>> actually be using the coder (likely via Reshuffle.viaRandomKey) rather than
>>> consuming the output in a fused ParDo (which I think is what we would
>>> recommend).
>>>
>>> As a counter proposal: is there something that stops us from using
>>> RowCoder by default here? I assume all forms of "PubsubMessage" can be
>>> encoded with RowCoder, it provides flexibility for future changes, and
>>> PubSub will be able to take advantage of future work to make RowCoder more
>>> efficient. (If we can't switch to RowCoder, that seems like a useful
>>> feature request for RowCoder.)
>>>
>>> Andrew
>>>
>>> On Mon, Dec 19, 2022 at 3:37 PM Evan Galpin <egal...@apache.org> wrote:
>>>
>>>> Bump 🙃
>>>>
>>>> Any other risks or drawbacks associated with altering the default coder
>>>> for PubsubMessage to be the most inclusive coder with respect to possible
>>>> fields?
>>>>
>>>> Thanks,
>>>> Evan
>>>>
>>>>
>>>> On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin <egal...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi folks,
>>>>>
>>>>> I'd like to solicit feedback on the notion of using
>>>>> PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the
>>>>> default coder for Pubsub messages instead of the current default of
>>>>> PubsubMessageWithAttributesCoder.
>>>>>
>>>>> Not long ago, support for reading and writing Pubsub messages in Beam
>>>>> including an OrderingKey was added[2].  Part of this change involved 
>>>>> adding
>>>>> a new Coder for PubsubMessage in order to capture and propagate the
>>>>> orderingKey[1].  This change illuminated that in cases where the coder 
>>>>> type
>>>>> for PubsubMessage is inferred, it is possible to accidentally and silently
>>>>> nullify fields like MessageId and OrderingKey in a way that is not at all
>>>>> obvious to users[3].
>>>>>
>>>>> So far two potential drawbacks of this proposal have been identified:
>>>>> 1. Update compatibility for pipelines using PubsubIO might require
>>>>> users to explicitly specify the current default coder (
>>>>> PubsubMessageWithAttributesCoder)
>>>>> 2. Messages would require a larger number of bytes to store as
>>>>> compared to the current default (which could again be overcome by users
>>>>> specifying the current default coder)
>>>>>
>>>>> What other potential drawbacks might there be? I look forward to
>>>>> hearing others' input!
>>>>>
>>>>> Thanks,
>>>>> Evan
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd
>>>>> [2] https://github.com/apache/beam/pull/22216
>>>>> [3] https://github.com/apache/beam/issues/23525
>>>>>
>>>>

Re: [Proposal] Change to Default PubsubMessage Coder

Reply via email to