On Fri, Dec 1, 2023 at 9:13 AM Steven van Rossum via dev
<dev@beam.apache.org> wrote:
>
> Hi all,
>
> I was benchmarking the fastjson2 serialization library a few weeks back for a 
> Java pipeline I was working on and was asked by a colleague to benchmark 
> binary JSON serialization against Rows for fun. We didn't do any extensive 
> analysis across different shapes and sizes, but the finding on this workload 
> was that serialization to binary JSON (tuple representation)

Meaning BSON I presume? What do you mean by "tuple representation"?
(One downside of JSON is that the field names are redundantly stored
in each record, so even if you save on CPU it may hurt on the network
due to the greater data sizes).

> outperformed the SchemaCoder on throughput by ~11x on serialization and ~5x 
> on deserialization. Additionally, RowCoder outperformed SchemaCoder on 
> throughput by ~1.3x on serialization and ~1.7x on deserialization. Note that 
> all benchmarks measured in the millions of ops/sec for this quick test, so 
> this is already excellent performance obviously.

Sounds like there's a lot of room for improvement! One downside of
Rows is that they can't (IIRC) store (and encode/decode) unboxed
representations of their primitive field types. This alone would be
good to solve, but as mentioned you could probably also skip a Row
intermediate altogether for encoding/decoding.

> I'm sure there's stuff to learn from other serialization libraries, but I'd 
> table that for now. The low hanging fruit improvement would be to skip that 
> intermediate hop to/from Row and instead generate custom SchemaCoders to 
> serialize directly into or deserialize from the Row format.
> I'd be happy to pick this up at some point in the new year, but would just 
> like to get some thoughts from this group.

+1, this'd be a great addition. I think there was some investigation
into using bytebuddy to auto-generate this kind of thing, but I don't
know how extensive it is.

Reply via email to