I see what you're saying. I was thinking about the span indices as it relates to data split across record batches -- if you had a shared "reference" array it could be treated like a dictionary, so if span indices split across record batches reference the same array, then it could be sent in a dictionary batch.
On Wed, May 2, 2018 at 5:03 PM, Brian Hulette <brian.hule...@ccri.com> wrote: > List also references another (data) array which can be a different size, but > rather than requiring it to be represented with a second schema, we make it > a child of the List type. We could do the same thing for a Span type, and > give it a new type of buffer that contains start/stop indices rather than > offsets. > > To Antoine's point, maybe there's not enough demand to justify defining this > type right now. I definitely agree that it would be good to see an example > dataset before adding something like this. > > Brian > > > On 05/02/2018 03:54 PM, Wes McKinney wrote: >>> >>> Perhaps that could be an argument for making span a core logical type? >> >> I think if anything, this argues that it should not be. Because "span" >> references another array, which can be a different size, you need two >> schemas to be able to make sense of it. >> >> In either case, I would be interested to see what modifications would >> be proposed to Schema.fbs and an example dataset described with such a >> schema (that is a single array, instead of multiple -- i.e. a >> non-composite representation). >> >> For the record, if there are sufficiently common "composite" data >> representations, I don't see a problem with developing community >> standards based on the building blocks we already have >> >> - Wes >> >> On Wed, May 2, 2018 at 3:38 PM, Brian Hulette <brian.hule...@ccri.com> >> wrote: >>> >>> If this were accomplished at the application level, how would it work >>> with >>> the IPC formats? I'd think you'd need to have two separate files (or >>> streams), since array 1 and array 2 will be different lengths. Perhaps >>> that >>> could be an argument for making span a core logical type? >>> >>> Brian >>> >>> >>> >>> On 05/02/2018 03:34 PM, Antoine Pitrou wrote: >>>> >>>> On Wed, 2 May 2018 10:12:37 -0400 >>>> Wes McKinney <wesmck...@gmail.com> wrote: >>>>> >>>>> It sounds like the "span" type could be implemented as a composite of >>>>> multiple Arrow arrays / schemas: >>>>> >>>>> array 1 (data) >>>>> any schema >>>>> >>>>> array 2 (view) >>>>> struct < >>>>> start: int64, >>>>> stop: int64 >>>>>> >>>>>> >>>>> Unless I'm missing something, this feels like an application-level >>>>> concern rather than something that needs to be addressed in the >>>>> columnar format / metadata. >>>> >>>> Well, couldn't the same theoretically be said about list arrays? >>>> In the end, I suppose it all depends whether there's enough demand to >>>> make it a core logical type inside Arrow, rather than something people >>>> write custom code for in their application. >>>> >>>> Regards >>>> >>>> Antoine. >>> >>> >