Hi Tim, I'd ideally like to see the work done in the Arrow C++ library so that it can be utilized by all the C++ "binders" (Python, R, C, Ruby, MATLAB). This also means a larger labor pool of individuals to help improve and maintain the software. There was a stalled PR around this a time back (check out the Arrow Closed PR queue) that got stuck on some limitations in avro-c. It might be more expedient to fork parts of Apache Avro and do all the development inside a single codebase.
There's a lot of folks that can provide feedback should you choose to go down this route. Thanks Wes On Tue, Jun 11, 2019, 4:53 PM Tim Swast <sw...@google.com.invalid> wrote: > Hi Arrow and Avro devs, > > I've been investigating some performance issues with the BigQuery Storage > API (https://github.com/googleapis/google-cloud-python/issues/7805), and > have identified that the vast majority of time is spent decoding Avro into > pandas dataframes. > <https://github.com/googleapis/google-cloud-python/issues/7805> > I've done some initial experiments by hand written parsers (inspired by > https://techblog.rtbhouse.com/2017/04/18/fast-avro/) and have seen a > dramatic improvement in time spent parsing. > > I'm considering releasing this as a separate package for the following > reasons: > > - Code generation + Numba is a bit of an unproven technique for parsers, > so I'd like to treat this as an experiment rather than "the" package to > use > to parse Avro from Python. > - I don't need to handle the full Avro spec for this experiment. > Importantly, BQ Storage API only uses a schemaless reader (since the > schema > is output only once, and omitted for subsequent protobuf messages) and > doesn't use any compression. > > That said, I'm open to contributing this to either pyarrow or avro if > there's interest. > > If the answer is "no" (as I suspect it is) and I don't contribute it now, > the package will be clearly identified as a fork of the Apache Avro project > and licensed Apache 2.0, so it should be easy to pull in once the > techniques are proven. > > * • **Tim Swast* > * • *Software Friendliness Engineer > * • *Google Cloud Developer Relations > * • *Seattle, WA, USA >