Thanks for your proposal.
Agreed Arrow readers/writers should have high performance like Orc reader, and 
as mentioned above, I think the current Avro adapter should be positioned as 
adapter rather than native reader. Not sure whether Arrow requires adapter 
based on library, I update the current design in ARROW-5845[1] for your 
information anyway.


Thanks,
Ji Liu

[1] https://issues.apache.org/jira/browse/ARROW-5845


------------------------------------------------------------------
From:Jacques Nadeau <jacq...@apache.org>
Send Time:2019年7月22日(星期一) 09:16
To:dev <dev@arrow.apache.org>; Micah Kornfield <emkornfi...@gmail.com>
Subject:Re: [DISCUSS][JAVA] Designs & goals for readers/writers

As I read through your responses, I think it might be useful to talk about
adapters versus native Arrow readers/writers. Adapters are something that
adapt an existing API to produce and/or consume Arrow data. A native
reader/writer is something that understand the format directly and does not
have intermediate representations or APIs the data moves through beyond
those that needs to be used to complete work.

If people want to write adapters for Arrow, I see that as useful but very
different than writing native implementations and we should try to create a
clear delineation between the two.

Further comments inline.


> Could you expand on what level of detail you would like to see a design
> document?
>

A couple paragraphs seems sufficient. This is the goals of the
implementation. We target existing functionality X. It is an adapter. Or it
is a native impl. This is the expected memory and processing
characteristics, etc.  I've never been one for huge amount of design but
I've seen a number of recent patches appear where this is no upfront
discussion. Making sure that multiple buy into a design is the best way to
ensure long-term maintenance and use.


> I think this should be optional (the same argument below about predicates
> apply so I won't repeat them).
>

Per my comments above, maybe adapter versus native reader clarifies things.
For example, I've been working on a native avro read implementation. It is
little more than chicken scratch at this point but its goals, vision and
design are very different than the adapter that is being produced atm.


> Can you clarify the intent of this objective.  Is it mainly to tie in with
> the existing Java arrow memory book keeping?  Performance?  Something else?
>

Arrow is designed to be off-heap. If you have large variable amounts of
on-heap memory in an application, it starts to make it very hard to make
decisions about off-heap versus on-heap memory since those divisions are by
and large static in nature. It's fine for short lived applications but for
long lived applications, if you're working with a large amount of data, you
want to keep most of your memory in one pool. In the context of Arrow, this
is going to naturally be off-heap memory.


> I'm afraid this might lead to a "perfect is the enemy of the good"
> situation.  Starting off with a known good implementation of conversion to
> Arrow can allow us to both to profile hot-spots and provide a comparison of
> implementations to verify correctness.
>

I'm not clear what message we're sending as a community if we produce low
performance components. The whole of Arrow is to increase performance, not
decrease it. I'm targeting good, not perfect. At the same time, from my
perspective, Arrow development should not be approached in the same way
that general Java app development should be. If we hold a high standard,
we'll have less total integrations initially but I think we'll solve more
real world problems.

There is also the question of how widely adoptable we want Arrow libraries
> to be.
> It isn't surprising to me that Impala's Avro reader is an order of
> magnitude faster then the stock Java one.  As far as I know Impala's is a
> C++ implementation that does JIT with LLVM.  We could try to use it as a
> basis for converting to Arrow but I think this might limit adoption in some
> circumstances.  Some organizations/people might be hesitant to adopt the
> technology due to:
> 1.  Use of JNI.
> 2.  Use LLVM to do JIT.
>
> It seems that as long as we have a reasonably general interface to
> data-sources we should be able to optimize/refactor aggressively when
> needed.
>

This is somewhat the crux of the problem. It goes a little bit to who our
consuming audience is and what we're trying to deliver. I'll also say that
trying to build a high-quality implementation on top of low-quality
implementation or library-based adapter is worse than starting from
scratch. I believe this is especially true in Java where developers are
trained to trust hotspot and that things will be good enough. That is great
in a web app but not in systems software where we (and I expect others)
will deploy Arrow.


> >    3. Propose a generalized "reader" interface as opposed to making each
> >    reader have a different way to package/integrate.
>
> This also seems like a good idea.  Is this something you were thinking of
> doing or just a proposal that someone in the community should take up
> before we get too many more implementations?
>

I don't have something in mind and didn't have a plan to build something,
just want to make sure we start getting consistent early as opposed to once
we have a bunch of readers/adapters.

Reply via email to