Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

Francois Saint-Jacques Mon, 15 Apr 2019 11:05:56 -0700

Thanks for the clarification Antoine, very insightful.

I'd also vote for keeping the existing model for consistency.


On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Hi,
>
> I am not Jacques, but I will try to give my own point of view on this.
>
> The distinction between logical and physical types can be modelled in
> two different ways:
>
> 1) a physical type can denote several logical types, but a logical type
> can only have a single physical representation.  This is currently the
> Arrow model.
>
> 2) a physical type can denote several logical types, and a logical type
> can also be denoted by several physical types.  This is the Parquet model.
>
> (theoretically, there are two other possible models, but they are not
> very interesting to consider, since they don't seem to cater to concrete
> use cases)
>
> Model 1 is obviously more restrictive, while model 2 is more flexible.
> Model 2 could be said "higher level"; you see something similar if you
> compare Python's and C++'s typing systems.  On the other hand, model 1
> provides a potentially simpler programming model for implementors of
> low-level kernels, as you can simply query the logical type of your data
> and you automatically know its physical type.
>
> The model chosen for Arrow is ingrained in its API.  If we want to
> change the model we'd better do it wholesale (implying probably a large
> refactoring and a significant number of unavoidable regressions) to
> avoid subjecting users to a confusing middle point.
>
> Also and as a sidenote, "convertibility" between different types can be
> a hairy subject... Having strict boundaries between types avoids being
> dragged into it too early.
>
>
> To return to the original subject: IMHO, LargeList (resp. LargeBinary)
> should be a distinct logical type from List (resp. Binary), the same way
> Int64 is a distinct logical type from Int32.
>
> Regards
>
> Antoine.
>
>
>
> Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit :
> > Hello,
> >
> > I would like understand where do we stand on logical types and physical
> > types. As I understand, this proposal is for the physical representation.
> >
> > In the context of an execution engine, the concept of logical types
> becomes
> > more important as two physical representation might have the same
> semantical
> > values, e.g. LargeList and List where all values fits in the 32-bits.  A
> > more
> > complex example would be an Integer array and a dictionary array where
> > values
> > are integers.
> >
> > Is this something only something only relevant for execution engine? What
> > about
> > the (C++) Array.Equals method and related comparisons methods? This also
> > touch
> > the subject of type equality, e.g. dict with different but compatible
> > encoding.
> >
> > Jacques, knowing that you worked on Parquet (which follows this model)
> and
> > Dremio,
> > what is your opinion?
> >
> > François
> >
> > Some related tickets:
> > - https://jira.apache.org/jira/browse/ARROW-554
> > - https://jira.apache.org/jira/browse/ARROW-1741
> > - https://jira.apache.org/jira/browse/ARROW-3144
> > - https://jira.apache.org/jira/browse/ARROW-4097
> > - https://jira.apache.org/jira/browse/ARROW-5052
> >
> >
> >
> > On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit
> offsets
> >> to Lists, Strings and binary data types.
> >>
> >> Philipp started an implementation for the large list type [3] and I
> hacked
> >> together a potentially viable java implementation [4]
> >>
> >> I'd like to kickoff the discussion for getting these types voted on.
> I'm
> >> coupling them together because I think there are design consideration
> for
> >> how we evolve Schema.fbs
> >>
> >> There are two proposed options:
> >> 1.  The current PR proposal which adds a new type LargeList:
> >>   // List with 64-bit offsets
> >>   table LargeList {}
> >>
> >> 2.  As François suggested, it might cleaner to parameterize List with
> >> offset width.  I suppose something like:
> >>
> >> table List {
> >>   // only 32 bit and 64 bit is supported.
> >>   bitWidth: int = 32;
> >> }
> >>
> >> I think Option 2 is cleaner and potentially better long-term, but I
> think
> >> it breaks forward compatibility of the existing arrow libraries.  If we
> >> proceed with Option 2, I would advocate making the change to Schema.fbs
> all
> >> at once for all types (assuming we think that 64-bit offsets are
> desirable
> >> for all types) along with future compatibility checks to avoid multiple
> >> releases were future compatibility is broken (by broken I mean the
> >> inability to detect that an implementation is receiving data it can't
> >> read).    What are peoples thoughts on this?
> >>
> >> Also, any other concern with adding these types?
> >>
> >> Thanks,
> >> Micah
> >>
> >> [1] https://issues.apache.org/jira/browse/ARROW-4810
> >> [2] https://issues.apache.org/jira/browse/ARROW-750
> >> [3] https://github.com/apache/arrow/pull/3848
> >> [4]
> >>
> >>
> https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1
> >>
> >
>

Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

Reply via email to