Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Phillip Cloud
Agreed. I hope that I didn't come off as flippant with respect to performance. I was hoping to convey that I think focusing on performance before we have the semantics and high level design nailed down is not time well spent. I think the current design doesn't depend on the format, which is a goo

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Keith Kraus
> Personally, I do not care about the speed of IR processing right now. > Any non-trivial (and probably trivial too) computation done > by an IR consumer will dwarf the cost of IR processing. Of course, > we shouldn't prematurely pessimize either, but there's no reason > to spend time worrying abou

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Weston Pace
I believe you would need a JSON compatible version of the type system (including binary values) because you'd need to at least encode literals. However, I don't think that creating a human readable encoding of the Arrow type system is a bad thing in and of itself. We have tickets and get question

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Jacob Quinn
> > I just thought of one other requirement: the format needs to support > arbitrary byte sequences. > Can you clarify why this is needed? Is it that custom_metadata maps should allow byte sequences as values? On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud wrote: > On Fri, Aug 13, 2021 at 11:43

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Phillip Cloud
On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou wrote: > > Le 13/08/2021 à 17:35, Phillip Cloud a écrit : > > > >> I.e. make the ability to read and write by humans be more important than > >> speed of validation. > > > > I think I differ on whether the IR should be easy to read and write by > >

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Antoine Pitrou
Le 13/08/2021 à 17:35, Phillip Cloud a écrit : I.e. make the ability to read and write by humans be more important than speed of validation. I think I differ on whether the IR should be easy to read and write by humans. IR is going to be predominantly read and written by machines, though of

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Phillip Cloud
On Fri, Aug 13, 2021 at 8:03 AM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Hi, > > The requirements for the compute IR as I see it are: > > > > * Implementations in IR producer and consumer languages. > > * Strongly typed or the ability to easily validate a payload > > > > What abou

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Wes McKinney
On Fri, Aug 13, 2021 at 2:03 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Hi, > > The requirements for the compute IR as I see it are: > > > > * Implementations in IR producer and consumer languages. > > * Strongly typed or the ability to easily validate a payload > > > > What abou

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Jorge Cardoso Leitão
Hi, The requirements for the compute IR as I see it are: > > * Implementations in IR producer and consumer languages. > * Strongly typed or the ability to easily validate a payload > What about: 1. easy to read and write by a large number of programming languages 2. easy to read and write by hum

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Phillip Cloud
On Thu, Aug 12, 2021 at 1:03 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > I agree with Antoine that we should weigh the pros and cons of flatbuffers > (or protobuf or thrift for that matter) over a more human-friendly, > simpler, format like json or MsgPack. I also struggle a bit t

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Jorge Cardoso Leitão
I agree with Antoine that we should weigh the pros and cons of flatbuffers (or protobuf or thrift for that matter) over a more human-friendly, simpler, format like json or MsgPack. I also struggle a bit to reason with the complexity of using flatbuffers for this. E.g. there is no async support for

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Antoine Pitrou
Le 12/08/2021 à 15:05, Wes McKinney a écrit : It seems that one adjacent problem here is how to make it simpler for third parties (especially ones that act as front end interfaces) to build and serialize/deserialize the IR structures with some kind of ready-to-go middleware library, written in

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Wes McKinney
On Thu, Aug 12, 2021 at 3:16 PM Neal Richardson wrote: > > > Maintain this "Arrow types and ComputeIR library" as an always > zero-dependency library to facilitate vendoring > > Would/should this hypothetical zero-dep, vendorable library also include > the IPC format? Or if you want to interact wi

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Neal Richardson
> Maintain this "Arrow types and ComputeIR library" as an always zero-dependency library to facilitate vendoring Would/should this hypothetical zero-dep, vendorable library also include the IPC format? Or if you want to interact with IPC in that case, the C data interface is the best/only option?

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Phillip Cloud
On Thu, Aug 12, 2021 at 9:06 AM Wes McKinney wrote: > It seems that one adjacent problem here is how to make it simpler for > third parties (especially ones that act as front end interfaces) to > build and serialize/deserialize the IR structures with some kind of > ready-to-go middleware library,

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Wes McKinney
It seems that one adjacent problem here is how to make it simpler for third parties (especially ones that act as front end interfaces) to build and serialize/deserialize the IR structures with some kind of ready-to-go middleware library, written in a language like C++. To do that, one would need t

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Andrew Lamb
I support the idea of an independent repo that has the arrow flatbuffers format definition files. My rationale is that the Rust implementation has a copy of the `format` directory [1] and potential drift worries me (a bit). Having a single source of truth for the format that is not part of the lar

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021, 19:05 Weston Pace wrote: > >> The benefit is that IR components don't interact much with > `flatbuffers` or > >> `flatc` directly. > >> > [...] > >> > >> One counter-proposal might be to just put the compute IR IDL in a > separate > >> repo, > >> but that isn't tenable becau

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Weston Pace
>> The benefit is that IR components don't interact much with `flatbuffers` or >> `flatc` directly. >> [...] >> >> One counter-proposal might be to just put the compute IR IDL in a separate >> repo, >> but that isn't tenable because the compute IR needs arrow's type information >> contained in `Sch

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou
Le 11/08/2021 à 23:06, Phillip Cloud a écrit : On Wed, Aug 11, 2021 at 4:22 PM Antoine Pitrou wrote: Le 11/08/2021 à 22:16, Phillip Cloud a écrit : Yeah, that is a drawback here, though I don't see needing to run flatc as a major downside given the upside of not having to write additiona

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 4:21 PM David Li wrote: > If the worry is public distribution (i.e. requiring all downstream > projects to also run flatc in their builds) we could perhaps ship a package > that just consists of the generated code (though that's definitely more > packaging burden, and won'

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 4:22 PM Antoine Pitrou wrote: > > Le 11/08/2021 à 22:16, Phillip Cloud a écrit : > > > > Yeah, that is a drawback here, though I don't see needing to run flatc > as a > > major downside given the upside > > of not having to write additional code to move between formats. >

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou
Le 11/08/2021 à 22:20, David Li a écrit : If the worry is public distribution (i.e. requiring all downstream projects to also run flatc in their builds) we could perhaps ship a package that just consists of the generated code (though that's definitely more packaging burden, and won't help wh

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou
Le 11/08/2021 à 22:16, Phillip Cloud a écrit : Yeah, that is a drawback here, though I don't see needing to run flatc as a major downside given the upside of not having to write additional code to move between formats. That's only an advantage if you already know how to read the Arrow IPC f

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread David Li
If the worry is public distribution (i.e. requiring all downstream projects to also run flatc in their builds) we could perhaps ship a package that just consists of the generated code (though that's definitely more packaging burden, and won't help when you're doing development against in-progres

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 4:05 PM Antoine Pitrou wrote: > > Le 11/08/2021 à 22:02, Phillip Cloud a écrit : > > On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou > wrote: > > > >> > >> Le 11/08/2021 à 21:56, Phillip Cloud a écrit : > >>> I can see how that might be a bit circular. Let me start from th

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou
Le 11/08/2021 à 22:02, Phillip Cloud a écrit : On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou wrote: Le 11/08/2021 à 21:56, Phillip Cloud a écrit : I can see how that might be a bit circular. Let me start from the perspective of requirements. We want to be able to reuse the arrow's types

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou wrote: > > Le 11/08/2021 à 21:56, Phillip Cloud a écrit : > > I can see how that might be a bit circular. Let me start from the > > perspective of requirements. We want to be able to reuse the arrow's > types > > and schema, without having to write a

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou
Le 11/08/2021 à 21:56, Phillip Cloud a écrit : I can see how that might be a bit circular. Let me start from the perspective of requirements. We want to be able to reuse the arrow's types and schema, without having to write additional code to move back and forth between compute IR and not-compu

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
I can see how that might be a bit circular. Let me start from the perspective of requirements. We want to be able to reuse the arrow's types and schema, without having to write additional code to move back and forth between compute IR and not-compute-IR. I think that leaves only flatbuffers as an o

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 3:51 PM Antoine Pitrou wrote: > > > Le 11/08/2021 à 21:39, Phillip Cloud a écrit : > > The benefit is that IR components don't interact much with `flatbuffers` > or > > `flatc` directly. > > > [...] > > > > One counter-proposal might be to just put the compute IR IDL in a

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou
Le 11/08/2021 à 21:39, Phillip Cloud a écrit : The benefit is that IR components don't interact much with `flatbuffers` or `flatc` directly. [...] One counter-proposal might be to just put the compute IR IDL in a separate repo, but that isn't tenable because the compute IR needs arrow's ty

[DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
Hi all, I'd like to bring up an idea from a recent thread ([1]) about moving the `format/` directory out of the primary apache/arrow repository. I understand from that thread there are some concerns about using submodules, and I definitely sympathize with those concerns. In talking with David Li