Hi Tim,

I'd ideally like to see the work done in the Arrow C++ library so that it
can be utilized by all the C++ "binders" (Python, R, C, Ruby, MATLAB). This
also means a larger labor pool of individuals to help improve and maintain
the software. There was a stalled PR around this a time back (check out the
Arrow Closed PR queue) that got stuck on some limitations in avro-c. It
might be more expedient to fork parts of Apache Avro and do all the
development inside a single codebase.

There's a lot of folks that can provide feedback should you choose to go
down this route.

Thanks
Wes

On Tue, Jun 11, 2019, 4:53 PM Tim Swast <sw...@google.com.invalid> wrote:

> Hi Arrow and Avro devs,
>
> I've been investigating some performance issues with the BigQuery Storage
> API (https://github.com/googleapis/google-cloud-python/issues/7805), and
> have identified that the vast majority of time is spent decoding Avro into
> pandas dataframes.
> <https://github.com/googleapis/google-cloud-python/issues/7805>
> I've done some initial experiments by hand written parsers (inspired by
> https://techblog.rtbhouse.com/2017/04/18/fast-avro/) and have seen a
> dramatic improvement in time spent parsing.
>
> I'm considering releasing this as a separate package for the following
> reasons:
>
>    - Code generation + Numba is a bit of an unproven technique for parsers,
>    so I'd like to treat this as an experiment rather than "the" package to
> use
>    to parse Avro from Python.
>    - I don't need to handle the full Avro spec for this experiment.
>    Importantly, BQ Storage API only uses a schemaless reader (since the
> schema
>    is output only once, and omitted for subsequent protobuf messages) and
>    doesn't use any compression.
>
> That said, I'm open to contributing this to either pyarrow or avro if
> there's interest.
>
> If the answer is "no" (as I suspect it is) and I don't contribute it now,
> the package will be clearly identified as a fork of the Apache Avro project
> and licensed Apache 2.0, so it should be easy to pull in once the
> techniques are proven.
>
> *  •  **Tim Swast*
> *  •  *Software Friendliness Engineer
> *  •  *Google Cloud Developer Relations
> *  •  *Seattle, WA, USA
>

Reply via email to