Evaluating options for an efficient C++ implementation of Avro for Apache Arrow users

Wes McKinney Tue, 18 Jun 2019 09:12:57 -0700

hi folks,

(cross-posting to dev@avro and dev@arrow -- please subscribe to both
mailing lists to participate in the thread)


In the Apache Arrow community, we are striving to develop optimized,
batch-oriented interfaces C++ to read and write various open standard
file formats, such as

- Parquet
- ORC
- CSV
- Line-delimited JSON
- Avro

The Arrow C++ codebase has been co-developed with the Parquet C++
codebase by many of the same inidividuals, so that's our most mature
implementation, but we also have ORC, CSV, and JSON now in various
states of maturity and performance.

Since Arrow is a columnar format, the intention is to work with a
batch of records at a time, such as 64K records or so -- efficient
deserializing into a columnar batch requires a certain design approach
that general purpose libraries cannot always easily accommodate.

There is interest in working on Avro support and so we've (primarily
Micah Kornfield, though I've been eyeing the project myself for some
time) been investigating approaches to the project that are pragmatic
and likely to yield good results. Some options to consider:

* A new designed-for-Arrow Avro implementation in C++
* Using avro-c as a library and contributing patches upstream
* Using avro-c++ as a library and contributing patches upstream
* Forking avro-c or avro-c++ and modifying at will for use in Apache Arrow

The intended users for this software are not only C++ developers but
also languages that bind the C++ libraries, including Python, R, Ruby,
and MATLAB. So this software is of high importance to very large
programmer communities -- currently the quality (in terms of
performance or usability) of Avro software in these languages is
relatively poor (consider, for instance, that there are no fewer than
4 Avro libraries for Python -- avro, fastavro, uavro, and cyavro).

Our current inclination is that forking avro-c++ into the Arrow
codebase is the preferred approach for a number of reasons:

* We are already using C++11, and so using C++ as a starting point is
preferable to C
* Decoupling from Apache Avro release cycles: Arrow is about to have
its 14th major release in a little over 3 years -- our release cadence
is approximately every 2 to 3 months. It also spares us having to
manage Avro as a third party build dependency
* Freedom to refactor serialization and deserialization paths to
feature Arrow-specific optimizations and batch-centric APIs
* Desire to remove Avro-specific memory management and IO interfaces
and use common Arrow ones (also used in Parquet C++ and the
Arrow-centric CSV and JSON libraries)
* Interest in developing Arrow-centric LLVM code generation for
optimized decoding of records

We understand that forking a codebase is not a decision that should be
undertaken flippantly and so we'd like to collect feedback from the
Avro community and the C++ developers in particular about this
project, which is currently at the "codebase import stage" [1]

To head off one possible question, I do not think that developing
Arrow specializations _inside_ apache/avro is a desirable option as it
would introduce a circular dependency between codebases as we wish to
develop bindings for Avro+Arrow in Python, R, Ruby, etc. (these are
found in apache/arrow). We did this for more than 2 years with Parquet
in apache/parquet-cpp and the development process (CI, testing,
packaging) was deeply unpleasant for Arrow and Parquet alike.

Thank you,
Wes

[1]: https://github.com/apache/arrow/pull/4585

Evaluating options for an efficient C++ implementation of Avro for Apache Arrow users

Reply via email to