Hi Jacques,
I added more comments/questions inline, but as a TL;DR; Generally these all
sound like good goals, but I have concern that as policy it might lead to a
"boil the ocean" type approach that could potentially delay useful
functionality.

Thanks,
Micah

On Sun, Jul 21, 2019 at 2:41 PM Jacques Nadeau <jacq...@apache.org> wrote:

> I've seen a couple of recent pieces of work on generating new
> readers/writers for Arrow (Avro and discussion of CSV). I'd like to propose
> a couple of guidelines to help ensure a high quality bar:
>
>    1. Design review first - Before someone starts implementing a particular
>    reader/writer, let's ask for a basic design outline in jira, google
> docs,
>    etc.
>
Could you expand on what level of detail you would like to see a design
document?

   2. High bar for implementation: Having more readers for the sake of more
>    readers should not be the goal of the project. Instead, people should
>    expect Arrow Java readers to be high quality and faster than other
> readers
>    (even if the consumer has to do a final conversion to move from the
> Arrow
>    representation to their current internal representation). As such, I
>    propose the following two bars as part of design work:
>       1. Field selection support as part of reads - Make sure that each
>       implementation supports field selection (which columns to
> materialize) as
>       part of the interface.
>

I think this should be optional (the same argument below about predicates
apply so I won't repeat them).


>       2. Configurable target batch size - Different systems will want to
>       control the target size of batch data.
>

Agree this should be supported by all readers.  I view the Avro
implementation as a work in progress, but I did raise this on the PRs and
expect it should be done before we call the Avro work done.


>       3. Minimize use of heap memory - Most of the core existing Arrow Java
>       libraries have been very focused on minimizing on-heap memory
> consumption.
>       While there may be some, we continue to try reduce the footprint as
> small
>       as possible. When creating new readers/writers, I think we should
> target
>       the same standard for new readers. For example, the current Avro
> reader PR
>       relies heavily on the Java Avro project's reader implementation
> which has
>       very poor heap characteristics.
>

Can you clarify the intent of this objective.  Is it mainly to tie in with
the existing Java arrow memory book keeping?  Performance?  Something else?

      4. Industry leading performance - People should expect that using
>       Arrow stuff is very fast. Releasing something under this banner
> means we
>       should focus on achieving that kind of target. To pick on the Avro
> reader
>       again here, our previous analysis has shown that the Java Avro
> project's
>       reader (not the Arrow connected impl) is frequently an order of
> magnitude+
>       slower than some other open source Avro readers (such as Impala's
>       implementation), especially when applying any predicates or
> projections.
>

I'm afraid this might lead to a "perfect is the enemy of the good"
situation.  Starting off with a known good implementation of conversion to
Arrow can allow us to both to profile hot-spots and provide a comparison of
implementations to verify correctness.

There is also the question of how widely adoptable we want Arrow libraries
to be.
It isn't surprising to me that Impala's Avro reader is an order of
magnitude faster then the stock Java one.  As far as I know Impala's is a
C++ implementation that does JIT with LLVM.  We could try to use it as a
basis for converting to Arrow but I think this might limit adoption in some
circumstances.  Some organizations/people might be hesitant to adopt the
technology due to:
1.  Use of JNI.
2.  Use LLVM to do JIT.

It seems that as long as we have a reasonably general interface to
data-sources we should be able to optimize/refactor aggressively when
needed.

      5. (Ideally) Predicate application as part of reads - 99% in
>       workloads we've, a user is frequently applying one or more
> predicates when
>       reading data. Whatever performance you gain from a strong
> implementation
>       for reads will be drown out in most cases if you fail apply
> predicates as
>       part of reading (and thus have to materialize far more records
> than you'll
>       need in a minute).
>

I agree this would probably be useful, and something that should be
considered as part of a generalized reader.  It doesn't seem like it should
necessarily block implementations.  For instance, as far as I know this
isn't implemented in the C++ CSV Reader (and I'm pretty sure the other file
format readers we in C++ don't support it yet either).  Also, as far as I
know Apache Spark treats predicate push-downs on its data-sets as optional.


>    3. Propose a generalized "reader" interface as opposed to making each
>    reader have a different way to package/integrate.
>

This also seems like a good idea.  Is this something you were thinking of
doing or just a proposal that someone in the community should take up
before we get too many more implementations?

Reply via email to