Thanks for the clarification Antoine, very insightful. I'd also vote for keeping the existing model for consistency.
On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou <anto...@python.org> wrote: > > Hi, > > I am not Jacques, but I will try to give my own point of view on this. > > The distinction between logical and physical types can be modelled in > two different ways: > > 1) a physical type can denote several logical types, but a logical type > can only have a single physical representation. This is currently the > Arrow model. > > 2) a physical type can denote several logical types, and a logical type > can also be denoted by several physical types. This is the Parquet model. > > (theoretically, there are two other possible models, but they are not > very interesting to consider, since they don't seem to cater to concrete > use cases) > > Model 1 is obviously more restrictive, while model 2 is more flexible. > Model 2 could be said "higher level"; you see something similar if you > compare Python's and C++'s typing systems. On the other hand, model 1 > provides a potentially simpler programming model for implementors of > low-level kernels, as you can simply query the logical type of your data > and you automatically know its physical type. > > The model chosen for Arrow is ingrained in its API. If we want to > change the model we'd better do it wholesale (implying probably a large > refactoring and a significant number of unavoidable regressions) to > avoid subjecting users to a confusing middle point. > > Also and as a sidenote, "convertibility" between different types can be > a hairy subject... Having strict boundaries between types avoids being > dragged into it too early. > > > To return to the original subject: IMHO, LargeList (resp. LargeBinary) > should be a distinct logical type from List (resp. Binary), the same way > Int64 is a distinct logical type from Int32. > > Regards > > Antoine. > > > > Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit : > > Hello, > > > > I would like understand where do we stand on logical types and physical > > types. As I understand, this proposal is for the physical representation. > > > > In the context of an execution engine, the concept of logical types > becomes > > more important as two physical representation might have the same > semantical > > values, e.g. LargeList and List where all values fits in the 32-bits. A > > more > > complex example would be an Integer array and a dictionary array where > > values > > are integers. > > > > Is this something only something only relevant for execution engine? What > > about > > the (C++) Array.Equals method and related comparisons methods? This also > > touch > > the subject of type equality, e.g. dict with different but compatible > > encoding. > > > > Jacques, knowing that you worked on Parquet (which follows this model) > and > > Dremio, > > what is your opinion? > > > > François > > > > Some related tickets: > > - https://jira.apache.org/jira/browse/ARROW-554 > > - https://jira.apache.org/jira/browse/ARROW-1741 > > - https://jira.apache.org/jira/browse/ARROW-3144 > > - https://jira.apache.org/jira/browse/ARROW-4097 > > - https://jira.apache.org/jira/browse/ARROW-5052 > > > > > > > > On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > >> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit > offsets > >> to Lists, Strings and binary data types. > >> > >> Philipp started an implementation for the large list type [3] and I > hacked > >> together a potentially viable java implementation [4] > >> > >> I'd like to kickoff the discussion for getting these types voted on. > I'm > >> coupling them together because I think there are design consideration > for > >> how we evolve Schema.fbs > >> > >> There are two proposed options: > >> 1. The current PR proposal which adds a new type LargeList: > >> // List with 64-bit offsets > >> table LargeList {} > >> > >> 2. As François suggested, it might cleaner to parameterize List with > >> offset width. I suppose something like: > >> > >> table List { > >> // only 32 bit and 64 bit is supported. > >> bitWidth: int = 32; > >> } > >> > >> I think Option 2 is cleaner and potentially better long-term, but I > think > >> it breaks forward compatibility of the existing arrow libraries. If we > >> proceed with Option 2, I would advocate making the change to Schema.fbs > all > >> at once for all types (assuming we think that 64-bit offsets are > desirable > >> for all types) along with future compatibility checks to avoid multiple > >> releases were future compatibility is broken (by broken I mean the > >> inability to detect that an implementation is receiving data it can't > >> read). What are peoples thoughts on this? > >> > >> Also, any other concern with adding these types? > >> > >> Thanks, > >> Micah > >> > >> [1] https://issues.apache.org/jira/browse/ARROW-4810 > >> [2] https://issues.apache.org/jira/browse/ARROW-750 > >> [3] https://github.com/apache/arrow/pull/3848 > >> [4] > >> > >> > https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1 > >> > > >