Thanks for getting the discussion started, Micah! I'm +1 on this change and also slightly prefer 1. As Antoine mentions, there doesn't seem to be a clear benefit from 2, unless we want to also support 8 or 16 bit indices in the future, which seems unlikely. So going with 1 is ok I think.
Best, Philipp. On Thu, Apr 11, 2019 at 7:06 AM Antoine Pitrou <anto...@python.org> wrote: > > Le 11/04/2019 à 10:52, Micah Kornfield a écrit : > > ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit offsets > > to Lists, Strings and binary data types. > > > > Philipp started an implementation for the large list type [3] and I > hacked > > together a potentially viable java implementation [4] > > > > I'd like to kickoff the discussion for getting these types voted on. I'm > > coupling them together because I think there are design consideration for > > how we evolve Schema.fbs > > > > There are two proposed options: > > 1. The current PR proposal which adds a new type LargeList: > > // List with 64-bit offsets > > table LargeList {} > > > > 2. As François suggested, it might cleaner to parameterize List with > > offset width. I suppose something like: > > > > table List { > > // only 32 bit and 64 bit is supported. > > bitWidth: int = 32; > > } > > > > I think Option 2 is cleaner and potentially better long-term, but I think > > it breaks forward compatibility of the existing arrow libraries. If we > > proceed with Option 2, I would advocate making the change to Schema.fbs > all > > at once for all types (assuming we think that 64-bit offsets are > desirable > > for all types) along with future compatibility checks to avoid multiple > > releases were future compatibility is broken (by broken I mean the > > inability to detect that an implementation is receiving data it can't > > read). What are peoples thoughts on this? > > I think Option 1 is ok. Making List / String / Binary parameterizable > doesn't bring anything *concretely*, since the types will not be > physically interchangeable. The cost of breaking compatibility should > be offset by a compelling benefit, which doesn't seem to exist here. > > Of course, implementations are free to refactor their internals to avoid > code duplication (for example the C++ ListBuilder and LargeListBuilder > classes could be instances of a BaseListBuilder<IndexType> generic type)... > > Regards > > Antoine. >