> Meaning BSON I presume? What do you mean by "tuple representation"? > (One downside of JSON is that the field names are redundantly stored > in each record, so even if you save on CPU it may hurt on the network > due to the greater data sizes).
Yes, I meant BSON. Tuple or array representation formats the serialized representation as an array of all field values so the field names are not stored in the serialized result. > Sounds like there's a lot of room for improvement! One downside of > Rows is that they can't (IIRC) store (and encode/decode) unboxed > representations of their primitive field types. This alone would be > good to solve, but as mentioned you could probably also skip a Row > intermediate altogether for encoding/decoding. If Row were an interface then you could generate a POJO at runtime from a Schema and have it implement that interface, but I'm not sure if that improves anything when it comes to serialization since you'd still use some function with a field index parameter to retrieve values from the Row instance, but it could be that the deserialized instance takes up less space in memory. Mapping out the RowCoderGenerator result into a specialized Coder for the POJO I was benchmarking resulted in an improvement of serialization throughput of ~2.2x and an improvement of deserialization throughput of ~1.6x.