hi folks, (cross-posting to dev@avro and dev@arrow -- please subscribe to both mailing lists to participate in the thread)
In the Apache Arrow community, we are striving to develop optimized, batch-oriented interfaces C++ to read and write various open standard file formats, such as - Parquet - ORC - CSV - Line-delimited JSON - Avro The Arrow C++ codebase has been co-developed with the Parquet C++ codebase by many of the same inidividuals, so that's our most mature implementation, but we also have ORC, CSV, and JSON now in various states of maturity and performance. Since Arrow is a columnar format, the intention is to work with a batch of records at a time, such as 64K records or so -- efficient deserializing into a columnar batch requires a certain design approach that general purpose libraries cannot always easily accommodate. There is interest in working on Avro support and so we've (primarily Micah Kornfield, though I've been eyeing the project myself for some time) been investigating approaches to the project that are pragmatic and likely to yield good results. Some options to consider: * A new designed-for-Arrow Avro implementation in C++ * Using avro-c as a library and contributing patches upstream * Using avro-c++ as a library and contributing patches upstream * Forking avro-c or avro-c++ and modifying at will for use in Apache Arrow The intended users for this software are not only C++ developers but also languages that bind the C++ libraries, including Python, R, Ruby, and MATLAB. So this software is of high importance to very large programmer communities -- currently the quality (in terms of performance or usability) of Avro software in these languages is relatively poor (consider, for instance, that there are no fewer than 4 Avro libraries for Python -- avro, fastavro, uavro, and cyavro). Our current inclination is that forking avro-c++ into the Arrow codebase is the preferred approach for a number of reasons: * We are already using C++11, and so using C++ as a starting point is preferable to C * Decoupling from Apache Avro release cycles: Arrow is about to have its 14th major release in a little over 3 years -- our release cadence is approximately every 2 to 3 months. It also spares us having to manage Avro as a third party build dependency * Freedom to refactor serialization and deserialization paths to feature Arrow-specific optimizations and batch-centric APIs * Desire to remove Avro-specific memory management and IO interfaces and use common Arrow ones (also used in Parquet C++ and the Arrow-centric CSV and JSON libraries) * Interest in developing Arrow-centric LLVM code generation for optimized decoding of records We understand that forking a codebase is not a decision that should be undertaken flippantly and so we'd like to collect feedback from the Avro community and the C++ developers in particular about this project, which is currently at the "codebase import stage" [1] To head off one possible question, I do not think that developing Arrow specializations _inside_ apache/avro is a desirable option as it would introduce a circular dependency between codebases as we wish to develop bindings for Avro+Arrow in Python, R, Ruby, etc. (these are found in apache/arrow). We did this for more than 2 years with Parquet in apache/parquet-cpp and the development process (CI, testing, packaging) was deeply unpleasant for Arrow and Parquet alike. Thank you, Wes [1]: https://github.com/apache/arrow/pull/4585
