Hi Arrow and Avro devs,

I've been investigating some performance issues with the BigQuery Storage
API (https://github.com/googleapis/google-cloud-python/issues/7805), and
have identified that the vast majority of time is spent decoding Avro into
pandas dataframes.
<https://github.com/googleapis/google-cloud-python/issues/7805>
I've done some initial experiments by hand written parsers (inspired by
https://techblog.rtbhouse.com/2017/04/18/fast-avro/) and have seen a
dramatic improvement in time spent parsing.

I'm considering releasing this as a separate package for the following
reasons:

   - Code generation + Numba is a bit of an unproven technique for parsers,
   so I'd like to treat this as an experiment rather than "the" package to use
   to parse Avro from Python.
   - I don't need to handle the full Avro spec for this experiment.
   Importantly, BQ Storage API only uses a schemaless reader (since the schema
   is output only once, and omitted for subsequent protobuf messages) and
   doesn't use any compression.

That said, I'm open to contributing this to either pyarrow or avro if
there's interest.

If the answer is "no" (as I suspect it is) and I don't contribute it now,
the package will be clearly identified as a fork of the Apache Avro project
and licensed Apache 2.0, so it should be easy to pull in once the
techniques are proven.

*  •  **Tim Swast*
*  •  *Software Friendliness Engineer
*  •  *Google Cloud Developer Relations
*  •  *Seattle, WA, USA

Reply via email to