Hi Arrow and Avro devs, I've been investigating some performance issues with the BigQuery Storage API (https://github.com/googleapis/google-cloud-python/issues/7805), and have identified that the vast majority of time is spent decoding Avro into pandas dataframes. <https://github.com/googleapis/google-cloud-python/issues/7805> I've done some initial experiments by hand written parsers (inspired by https://techblog.rtbhouse.com/2017/04/18/fast-avro/) and have seen a dramatic improvement in time spent parsing.
I'm considering releasing this as a separate package for the following reasons: - Code generation + Numba is a bit of an unproven technique for parsers, so I'd like to treat this as an experiment rather than "the" package to use to parse Avro from Python. - I don't need to handle the full Avro spec for this experiment. Importantly, BQ Storage API only uses a schemaless reader (since the schema is output only once, and omitted for subsequent protobuf messages) and doesn't use any compression. That said, I'm open to contributing this to either pyarrow or avro if there's interest. If the answer is "no" (as I suspect it is) and I don't contribute it now, the package will be clearly identified as a fork of the Apache Avro project and licensed Apache 2.0, so it should be easy to pull in once the techniques are proven. * • **Tim Swast* * • *Software Friendliness Engineer * • *Google Cloud Developer Relations * • *Seattle, WA, USA