Re: [Proposal] Change to Default PubsubMessage Coder

Robert Burke Mon, 19 Dec 2022 16:51:07 -0800

Technically, since the PubSubMessage is just a protocol buffer, I'm a bit
surprised we aren't using it's native encoding under the hood.


The Go SDK accepts either byte slices (which it then populates a messages's
data field) or accept the proto representations which it publish straight
out.

Similarly for reads, either the data field is extracted as bytes, or the
whole PubSubMessage.

Granted part of this is just because the coder set up in the Go SDK isn't
as maliable as Java or Python, and the types aren't recodable between
transforms.





Beam Go Busybody
Robert Burke

On Mon, Dec 19, 2022, 4:18 PM Ahmed Abualsaud via dev <dev@beam.apache.org>
wrote:

> +1 to looking into using RowCoder, this may help avoid creating more
> specialized coders in the future (which is mentioned as a pain point in the
> issue you linked [1]).
>
> [1] https://github.com/apache/beam/issues/23525#issuecomment-1281294275
>
> On Tue, Dec 20, 2022 at 3:00 AM Andrew Pilloud via dev <
> dev@beam.apache.org> wrote:
>
>> I think the Dataflow update concern is the biggest concern and as you
>> point out that can be easily overcome. I generally believe that users who
>> aren't setting the coder don't actually care as long as it works, so
>> changing the default across Beam versions seems relatively low
>> risk. Another mitigating factor is that both concerns require users to
>> actually be using the coder (likely via Reshuffle.viaRandomKey) rather than
>> consuming the output in a fused ParDo (which I think is what we would
>> recommend).
>>
>> As a counter proposal: is there something that stops us from using
>> RowCoder by default here? I assume all forms of "PubsubMessage" can be
>> encoded with RowCoder, it provides flexibility for future changes, and
>> PubSub will be able to take advantage of future work to make RowCoder more
>> efficient. (If we can't switch to RowCoder, that seems like a useful
>> feature request for RowCoder.)
>>
>> Andrew
>>
>> On Mon, Dec 19, 2022 at 3:37 PM Evan Galpin <egal...@apache.org> wrote:
>>
>>> Bump 🙃
>>>
>>> Any other risks or drawbacks associated with altering the default coder
>>> for PubsubMessage to be the most inclusive coder with respect to possible
>>> fields?
>>>
>>> Thanks,
>>> Evan
>>>
>>>
>>> On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin <egal...@apache.org> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> I'd like to solicit feedback on the notion of using
>>>> PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the
>>>> default coder for Pubsub messages instead of the current default of
>>>> PubsubMessageWithAttributesCoder.
>>>>
>>>> Not long ago, support for reading and writing Pubsub messages in Beam
>>>> including an OrderingKey was added[2].  Part of this change involved adding
>>>> a new Coder for PubsubMessage in order to capture and propagate the
>>>> orderingKey[1].  This change illuminated that in cases where the coder type
>>>> for PubsubMessage is inferred, it is possible to accidentally and silently
>>>> nullify fields like MessageId and OrderingKey in a way that is not at all
>>>> obvious to users[3].
>>>>
>>>> So far two potential drawbacks of this proposal have been identified:
>>>> 1. Update compatibility for pipelines using PubsubIO might require
>>>> users to explicitly specify the current default coder (
>>>> PubsubMessageWithAttributesCoder)
>>>> 2. Messages would require a larger number of bytes to store as compared
>>>> to the current default (which could again be overcome by users specifying
>>>> the current default coder)
>>>>
>>>> What other potential drawbacks might there be? I look forward to
>>>> hearing others' input!
>>>>
>>>> Thanks,
>>>> Evan
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd
>>>> [2] https://github.com/apache/beam/pull/22216
>>>> [3] https://github.com/apache/beam/issues/23525
>>>>
>>>

Re: [Proposal] Change to Default PubsubMessage Coder

Reply via email to