Thanks folks for pointing out RowCoder as a potential alternative, it does sound like an overall better solution. I wonder if exploration or RowCoder should be done as a follow up, given that the issue is marked as P1 and presumably migration to RowCoder will be more involved and more risky?
Otherwise, it sounds as though there isn't a high level of concern with respect to changing the default coder, so I'll plan to move forward with drafting a PR with that change (pending thoughts on RowCoder now Vs. later). On Mon, Dec 19, 2022 at 4:51 PM Robert Burke <rob...@frantil.com> wrote: > Technically, since the PubSubMessage is just a protocol buffer, I'm a bit > surprised we aren't using it's native encoding under the hood. > > The Go SDK accepts either byte slices (which it then populates a > messages's data field) or accept the proto representations which it publish > straight out. > > Similarly for reads, either the data field is extracted as bytes, or the > whole PubSubMessage. > > Granted part of this is just because the coder set up in the Go SDK isn't > as maliable as Java or Python, and the types aren't recodable between > transforms. > > > > > > Beam Go Busybody > Robert Burke > > On Mon, Dec 19, 2022, 4:18 PM Ahmed Abualsaud via dev <dev@beam.apache.org> > wrote: > >> +1 to looking into using RowCoder, this may help avoid creating more >> specialized coders in the future (which is mentioned as a pain point in the >> issue you linked [1]). >> >> [1] https://github.com/apache/beam/issues/23525#issuecomment-1281294275 >> >> On Tue, Dec 20, 2022 at 3:00 AM Andrew Pilloud via dev < >> dev@beam.apache.org> wrote: >> >>> I think the Dataflow update concern is the biggest concern and as you >>> point out that can be easily overcome. I generally believe that users who >>> aren't setting the coder don't actually care as long as it works, so >>> changing the default across Beam versions seems relatively low >>> risk. Another mitigating factor is that both concerns require users to >>> actually be using the coder (likely via Reshuffle.viaRandomKey) rather than >>> consuming the output in a fused ParDo (which I think is what we would >>> recommend). >>> >>> As a counter proposal: is there something that stops us from using >>> RowCoder by default here? I assume all forms of "PubsubMessage" can be >>> encoded with RowCoder, it provides flexibility for future changes, and >>> PubSub will be able to take advantage of future work to make RowCoder more >>> efficient. (If we can't switch to RowCoder, that seems like a useful >>> feature request for RowCoder.) >>> >>> Andrew >>> >>> On Mon, Dec 19, 2022 at 3:37 PM Evan Galpin <egal...@apache.org> wrote: >>> >>>> Bump 🙃 >>>> >>>> Any other risks or drawbacks associated with altering the default coder >>>> for PubsubMessage to be the most inclusive coder with respect to possible >>>> fields? >>>> >>>> Thanks, >>>> Evan >>>> >>>> >>>> On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin <egal...@apache.org> >>>> wrote: >>>> >>>>> Hi folks, >>>>> >>>>> I'd like to solicit feedback on the notion of using >>>>> PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the >>>>> default coder for Pubsub messages instead of the current default of >>>>> PubsubMessageWithAttributesCoder. >>>>> >>>>> Not long ago, support for reading and writing Pubsub messages in Beam >>>>> including an OrderingKey was added[2]. Part of this change involved >>>>> adding >>>>> a new Coder for PubsubMessage in order to capture and propagate the >>>>> orderingKey[1]. This change illuminated that in cases where the coder >>>>> type >>>>> for PubsubMessage is inferred, it is possible to accidentally and silently >>>>> nullify fields like MessageId and OrderingKey in a way that is not at all >>>>> obvious to users[3]. >>>>> >>>>> So far two potential drawbacks of this proposal have been identified: >>>>> 1. Update compatibility for pipelines using PubsubIO might require >>>>> users to explicitly specify the current default coder ( >>>>> PubsubMessageWithAttributesCoder) >>>>> 2. Messages would require a larger number of bytes to store as >>>>> compared to the current default (which could again be overcome by users >>>>> specifying the current default coder) >>>>> >>>>> What other potential drawbacks might there be? I look forward to >>>>> hearing others' input! >>>>> >>>>> Thanks, >>>>> Evan >>>>> >>>>> [1] >>>>> https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd >>>>> [2] https://github.com/apache/beam/pull/22216 >>>>> [3] https://github.com/apache/beam/issues/23525 >>>>> >>>>