Thanks Jacques. I'm ok dropping the fixed width proposal for now and revisiting it at a later point. I'll start a thread later today to break off the discussion on adding string/binary as a primitive type.
-Micah On Wed, Jul 13, 2016 at 7:49 AM, Jacques Nadeau <jacq...@apache.org> wrote: > > On Tue, Jul 12, 2016 at 10:42 PM, Micah Kornfield <emkornfi...@gmail.com> > wrote: >> >> Two questions come to mind. >> 1. Is it useful to have fixed width with list types exclusive of >> binary types? > > > I think "useful" isn't a strong enough reason to add more types. It seems > like a fairly rare occurrence and thus a premature optimization. (I could be > convinced otherwise with more evidence). I propose we avoid adding types > unless there are present use cases that people need to solve something. For > example, if the Hive guys are in the process of adopting Arrow and this > becomes a big memory/cpu issue for them. (I think the other memory/cpu > benefits of Arrow would make this highly unlikely for at least a year or > two.) > > There are a number of specializations that will come in time but I worry > that if we grow the types too wide (especially initially), everyone is only > going to support a subset of types and then we're going to have the same > challenges of incompatibility. Once we have two or three users who all are > working against variable width types and complaining about the overhead, it > seems like we are sure to build the right thing and avoid bit rot (something > that we (I) learned the hard way by adding all the types under the sun early > in the Drill ValueVectors construction). > >> >> 2. Should binary/string types have their own separate memory >> layout/be a primitive type? > > > I'm happy to cover this on a separate thread. My main argument is that the > most basic types most people need come in this order from my experience: > > Int > String > Float > Decimal > Binary > > Note that I'm not focused on width here, just generally "what people use". > So I think a string comes second in priority and ease of use/approachability > necessitate this as a first class concept. This is beyond the fact that it > has specialized rules that are separate from a List<Byte>. > > >> >> >> IMO, I think I think the answer to 1 is yes. Another example of a >> use-case where this is handy is for the outputs of the aggregate >> functions "histogram_numeric" and "percentile_approx" in Apache Hive >> [1]. >> >> For #2, I'm still not sure I see the a clear benefit or harm either >> way. The benefit of having there own type, is by definition, you >> don't need to worry about ill formed arrays (e.g. having a byte >> declared null). The potential cost is more code to deal with the >> additional types (although we end up paying this cost a little bit >> even if we treat everything as a list). >> >> Jacques can you elaborate more on where you see harm in the reduction? >> If we can agree on the first question, it might pay to handle the >> discussion of bytes/string as a primitive type on a separate thread (I >> think it got lost previously due to many issues surfaced in the same >> e-mail and a lack of time to do a google hangout. I apologize for >> that). >> >> Thanks, >> Micah >> >> [1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF >> >> On Tue, Jul 12, 2016 at 5:44 PM, Jacques Nadeau <jacq...@apache.org> >> wrote: >> > Completely in support of fixed bit width types. Just thinking that it >> > shouldn't be done by using a list. >> > >> > Not sure how the two are orthogonal. What am I missing? >> > >> > On Tue, Jul 12, 2016 at 5:38 PM, Wes McKinney <wesmck...@gmail.com> >> > wrote: >> >> >> >> I think it would be good to revisit that discussion. This is somewhat >> >> orthogonal -- i.e. having a fixed-width binary type that does not have >> >> an accompanying list of n + 1 offsets. >> >> >> >> On Tue, Jul 12, 2016 at 5:36 PM, Jacques Nadeau <jacq...@apache.org> >> >> wrote: >> >> > I was further reflecting on the previous discussion on lists and >> >> > binary/utf8. I think that treating strings (binary or utf8) as lists >> >> > is >> >> > too >> >> > much of reduction. This seems like a good example of how they are >> >> > treated >> >> > differently (beyond the previously discussed >> >> > not-possible-nullability). >> >> > As >> >> > such I'm -1 on this change and would prefer if we go back and further >> >> > review the concept of treating a string of bits, or bytes as a >> >> > "primitive" >> >> > type. >> >> > >> >> > On Tue, Jul 12, 2016 at 5:19 PM, Wes McKinney <wesmck...@gmail.com> >> >> > wrote: >> >> > >> >> >> I'm +1 on this. I've seen fixed-width strings and other things in >> >> >> many >> >> >> different contexts. I would say that fixed-width binary is probably >> >> >> the primary use case, but you could imaging casting int96 data to >> >> >> fixed_list<3, int32> >> >> >> >> >> >> On Mon, Jul 11, 2016 at 11:24 PM, Micah Kornfield >> >> >> <emkornfi...@gmail.com> >> >> >> wrote: >> >> >> > This came up in a code review a while ago, but what do people >> >> >> > think >> >> >> > of >> >> >> > adding a fixed width list type to the memory layout spec. >> >> >> > >> >> >> > This would have the same layout as the current list type. Instead >> >> >> > of >> >> >> > having a separate offset buffer to determine location and length >> >> >> > of >> >> >> > each list, the length would be stored as part of metadata and >> >> >> > offsets >> >> >> > would be calculated using multiplication instead of lookups. >> >> >> > >> >> >> > One use case for this is an easy mapping to the >> >> >> > "FIXED_LEN_BYTE_ARRAY" >> >> >> > in parquet. >> >> >> > >> >> >> > If people like the idea I can file a JIRA and update the current >> >> >> layout.md. >> >> >> > >> >> >> > Thanks, >> >> >> > -Micah >> >> >> >> > >> > > >