Out of curiosity, did you add a warmup time before benchmarking? Schema and row coder does codegen, so the first usage is very slow, but subsequent usages should be much faster. I recommend running any test for a warmup period before starting to measure.
On Fri, Dec 1, 2023, 9:13 AM Steven van Rossum via dev <dev@beam.apache.org> wrote: > Hi all, > > I was benchmarking the fastjson2 serialization library a few weeks back > for a Java pipeline I was working on and was asked by a colleague to > benchmark binary JSON serialization against Rows for fun. We didn't do any > extensive analysis across different shapes and sizes, but the finding on > this workload was that serialization to binary JSON (tuple representation) > outperformed the SchemaCoder on throughput by ~11x on serialization and ~5x > on deserialization. Additionally, RowCoder outperformed SchemaCoder on > throughput by ~1.3x on serialization and ~1.7x on deserialization. Note > that all benchmarks measured in the millions of ops/sec for this quick > test, so this is already excellent performance obviously. > > I'm sure there's stuff to learn from other serialization libraries, but > I'd table that for now. The low hanging fruit improvement would be to skip > that intermediate hop to/from Row and instead generate custom SchemaCoders > to serialize directly into or deserialize from the Row format. > I'd be happy to pick this up at some point in the new year, but would just > like to get some thoughts from this group. > > Regards, > > Steve >