I noticed that test data-related files are beginning to be checked in

https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/resources/schema/test.avsc

I wanted to make sure this doesn't turn into a slippery slope where we
end up with several megabytes or more of test data files

On Mon, Jul 22, 2019 at 11:39 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Hi Wes,
> Are there currently files that need to be moved?
>
> Thanks,
> Micah
>
> On Monday, July 22, 2019, Wes McKinney <wesmck...@gmail.com> wrote:
>>
>> Sort of tangentially related, but while we are on the topic:
>>
>> Please, if you would, avoid checking binary test data files into the
>> main repository. Use https://github.com/apache/arrow-testing if you
>> truly need to check in binary data -- something to look out for in
>> code reviews
>>
>> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield <emkornfi...@gmail.com> 
>> wrote:
>> >
>> > Hi Jacques,
>> > Thanks for the clarifications. I think the distinction is useful.
>> >
>> > If people want to write adapters for Arrow, I see that as useful but very
>> > > different than writing native implementations and we should try to 
>> > > create a
>> > > clear delineation between the two.
>> >
>> >
>> > What do you think about creating a "contrib" directory and moving the JDBC
>> > and AVRO adapters into it? We should also probably provide more description
>> > in pom.xml to make it clear for downstream consumers.
>> >
>> > We should probably come up with a name other than adapters for
>> > readers/writer ("converters"?) and use it in the directory structure for
>> > the existing Orc implementation?
>> >
>> > Thanks,
>> > Micah
>> >
>> >
>> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau <jacq...@apache.org> wrote:
>> >
>> > > As I read through your responses, I think it might be useful to talk 
>> > > about
>> > > adapters versus native Arrow readers/writers. Adapters are something that
>> > > adapt an existing API to produce and/or consume Arrow data. A native
>> > > reader/writer is something that understand the format directly and does 
>> > > not
>> > > have intermediate representations or APIs the data moves through beyond
>> > > those that needs to be used to complete work.
>> > >
>> > > If people want to write adapters for Arrow, I see that as useful but very
>> > > different than writing native implementations and we should try to 
>> > > create a
>> > > clear delineation between the two.
>> > >
>> > > Further comments inline.
>> > >
>> > >
>> > >> Could you expand on what level of detail you would like to see a design
>> > >> document?
>> > >>
>> > >
>> > > A couple paragraphs seems sufficient. This is the goals of the
>> > > implementation. We target existing functionality X. It is an adapter. Or 
>> > > it
>> > > is a native impl. This is the expected memory and processing
>> > > characteristics, etc.  I've never been one for huge amount of design but
>> > > I've seen a number of recent patches appear where this is no upfront
>> > > discussion. Making sure that multiple buy into a design is the best way 
>> > > to
>> > > ensure long-term maintenance and use.
>> > >
>> > >
>> > >> I think this should be optional (the same argument below about 
>> > >> predicates
>> > >> apply so I won't repeat them).
>> > >>
>> > >
>> > > Per my comments above, maybe adapter versus native reader clarifies
>> > > things. For example, I've been working on a native avro read
>> > > implementation. It is little more than chicken scratch at this point but
>> > > its goals, vision and design are very different than the adapter that is
>> > > being produced atm.
>> > >
>> > >
>> > >> Can you clarify the intent of this objective.  Is it mainly to tie in 
>> > >> with
>> > >> the existing Java arrow memory book keeping?  Performance?  Something
>> > >> else?
>> > >>
>> > >
>> > > Arrow is designed to be off-heap. If you have large variable amounts of
>> > > on-heap memory in an application, it starts to make it very hard to make
>> > > decisions about off-heap versus on-heap memory since those divisions are 
>> > > by
>> > > and large static in nature. It's fine for short lived applications but 
>> > > for
>> > > long lived applications, if you're working with a large amount of data, 
>> > > you
>> > > want to keep most of your memory in one pool. In the context of Arrow, 
>> > > this
>> > > is going to naturally be off-heap memory.
>> > >
>> > >
>> > >> I'm afraid this might lead to a "perfect is the enemy of the good"
>> > >> situation.  Starting off with a known good implementation of conversion 
>> > >> to
>> > >> Arrow can allow us to both to profile hot-spots and provide a comparison
>> > >> of
>> > >> implementations to verify correctness.
>> > >>
>> > >
>> > > I'm not clear what message we're sending as a community if we produce low
>> > > performance components. The whole of Arrow is to increase performance, 
>> > > not
>> > > decrease it. I'm targeting good, not perfect. At the same time, from my
>> > > perspective, Arrow development should not be approached in the same way
>> > > that general Java app development should be. If we hold a high standard,
>> > > we'll have less total integrations initially but I think we'll solve more
>> > > real world problems.
>> > >
>> > > There is also the question of how widely adoptable we want Arrow 
>> > > libraries
>> > >> to be.
>> > >> It isn't surprising to me that Impala's Avro reader is an order of
>> > >> magnitude faster then the stock Java one.  As far as I know Impala's is 
>> > >> a
>> > >> C++ implementation that does JIT with LLVM.  We could try to use it as a
>> > >> basis for converting to Arrow but I think this might limit adoption in
>> > >> some
>> > >> circumstances.  Some organizations/people might be hesitant to adopt the
>> > >> technology due to:
>> > >> 1.  Use of JNI.
>> > >> 2.  Use LLVM to do JIT.
>> > >>
>> > >> It seems that as long as we have a reasonably general interface to
>> > >> data-sources we should be able to optimize/refactor aggressively when
>> > >> needed.
>> > >>
>> > >
>> > > This is somewhat the crux of the problem. It goes a little bit to who our
>> > > consuming audience is and what we're trying to deliver. I'll also say 
>> > > that
>> > > trying to build a high-quality implementation on top of low-quality
>> > > implementation or library-based adapter is worse than starting from
>> > > scratch. I believe this is especially true in Java where developers are
>> > > trained to trust hotspot and that things will be good enough. That is 
>> > > great
>> > > in a web app but not in systems software where we (and I expect others)
>> > > will deploy Arrow.
>> > >
>> > >
>> > >> >    3. Propose a generalized "reader" interface as opposed to making 
>> > >> > each
>> > >> >    reader have a different way to package/integrate.
>> > >>
>> > >> This also seems like a good idea.  Is this something you were thinking 
>> > >> of
>> > >> doing or just a proposal that someone in the community should take up
>> > >> before we get too many more implementations?
>> > >>
>> > >
>> > > I don't have something in mind and didn't have a plan to build something,
>> > > just want to make sure we start getting consistent early as opposed to 
>> > > once
>> > > we have a bunch of readers/adapters.
>> > >

Reply via email to