I'd be +0.5 in favor of forking in this particular case. Since Avro is
not vectorized (unlike Parquet and ORC) I suspect it may be more
difficult to get the best performance using a general purpose API
versus one that is more specialized to producing Arrow record batches.
Given that has been relatively light C++ development activity in
Apache Avro and no releases for 2 years it does give me pause.

We might want to look at Impala's Avro scanner, they are doing some
LLVM IR cross-compilation also (they're using the Avro C++ library
though)

https://github.com/apache/impala/blob/master/be/src/exec/hdfs-avro-scanner-ir.cc
https://github.com/apache/impala/blob/master/be/src/exec/hdfs-avro-scanner.cc

On Tue, Mar 5, 2019 at 1:01 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> I'm looking at incorporating Avro in Arrow C++ [1]. It  seems that the Avro
> C++ library APIs  have improved from the last release.  However, it is not
> clear when a new release will be available (I asked on the  JIRA Item for
> the next release [2] and received no response).
>
> I was wondering if there is a policy governing using other Apache projects
> or how people felt about the following options:
> 1.  Depend on a specific git commit through the third-party library system.
> 2.  Copy the necessary source code temporarily to our project, and change
> to using the next release when it is available.
> 3.  Fork the code we need (the main benefit I see here is being able to
> refactor it to avoid having to deal with exceptions, easier integration
> with our IO system and one less 3rd party dependency to deal with).
> 4.  Wait on the 1.9 release before proceeding.
>
> Thanks,
> Micah
>
> [1] https://issues.apache.org/jira/browse/ARROW-1209
> [2] https://issues.apache.org/jira/browse/AVRO-2250

Reply via email to