On Tue, Jul 12, 2016 at 10:42 PM, Micah Kornfield <emkornfi...@gmail.com> wrote:
> Two questions come to mind. > 1. Is it useful to have fixed width with list types exclusive of > binary types? > I think "useful" isn't a strong enough reason to add more types. It seems like a fairly rare occurrence and thus a premature optimization. (I could be convinced otherwise with more evidence). I propose we avoid adding types unless there are present use cases that people need to solve something. For example, if the Hive guys are in the process of adopting Arrow and this becomes a big memory/cpu issue for them. (I think the other memory/cpu benefits of Arrow would make this highly unlikely for at least a year or two.) There are a number of specializations that will come in time but I worry that if we grow the types too wide (especially initially), everyone is only going to support a subset of types and then we're going to have the same challenges of incompatibility. Once we have two or three users who all are working against variable width types and complaining about the overhead, it seems like we are sure to build the right thing and avoid bit rot (something that we (I) learned the hard way by adding all the types under the sun early in the Drill ValueVectors construction). > 2. Should binary/string types have their own separate memory > layout/be a primitive type? > I'm happy to cover this on a separate thread. My main argument is that the most basic types most people need come in this order from my experience: Int String Float Decimal Binary Note that I'm not focused on width here, just generally "what people use". So I think a string comes second in priority and ease of use/approachability necessitate this as a first class concept. This is beyond the fact that it has specialized rules that are separate from a List<Byte>. > > IMO, I think I think the answer to 1 is yes. Another example of a > use-case where this is handy is for the outputs of the aggregate > functions "histogram_numeric" and "percentile_approx" in Apache Hive > [1]. > > For #2, I'm still not sure I see the a clear benefit or harm either > way. The benefit of having there own type, is by definition, you > don't need to worry about ill formed arrays (e.g. having a byte > declared null). The potential cost is more code to deal with the > additional types (although we end up paying this cost a little bit > even if we treat everything as a list). > > Jacques can you elaborate more on where you see harm in the reduction? > If we can agree on the first question, it might pay to handle the > discussion of bytes/string as a primitive type on a separate thread (I > think it got lost previously due to many issues surfaced in the same > e-mail and a lack of time to do a google hangout. I apologize for > that). > > Thanks, > Micah > > [1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > > On Tue, Jul 12, 2016 at 5:44 PM, Jacques Nadeau <jacq...@apache.org> > wrote: > > Completely in support of fixed bit width types. Just thinking that it > > shouldn't be done by using a list. > > > > Not sure how the two are orthogonal. What am I missing? > > > > On Tue, Jul 12, 2016 at 5:38 PM, Wes McKinney <wesmck...@gmail.com> > wrote: > >> > >> I think it would be good to revisit that discussion. This is somewhat > >> orthogonal -- i.e. having a fixed-width binary type that does not have > >> an accompanying list of n + 1 offsets. > >> > >> On Tue, Jul 12, 2016 at 5:36 PM, Jacques Nadeau <jacq...@apache.org> > >> wrote: > >> > I was further reflecting on the previous discussion on lists and > >> > binary/utf8. I think that treating strings (binary or utf8) as lists > is > >> > too > >> > much of reduction. This seems like a good example of how they are > >> > treated > >> > differently (beyond the previously discussed > not-possible-nullability). > >> > As > >> > such I'm -1 on this change and would prefer if we go back and further > >> > review the concept of treating a string of bits, or bytes as a > >> > "primitive" > >> > type. > >> > > >> > On Tue, Jul 12, 2016 at 5:19 PM, Wes McKinney <wesmck...@gmail.com> > >> > wrote: > >> > > >> >> I'm +1 on this. I've seen fixed-width strings and other things in > many > >> >> different contexts. I would say that fixed-width binary is probably > >> >> the primary use case, but you could imaging casting int96 data to > >> >> fixed_list<3, int32> > >> >> > >> >> On Mon, Jul 11, 2016 at 11:24 PM, Micah Kornfield > >> >> <emkornfi...@gmail.com> > >> >> wrote: > >> >> > This came up in a code review a while ago, but what do people think > >> >> > of > >> >> > adding a fixed width list type to the memory layout spec. > >> >> > > >> >> > This would have the same layout as the current list type. Instead > of > >> >> > having a separate offset buffer to determine location and length of > >> >> > each list, the length would be stored as part of metadata and > offsets > >> >> > would be calculated using multiplication instead of lookups. > >> >> > > >> >> > One use case for this is an easy mapping to the > >> >> > "FIXED_LEN_BYTE_ARRAY" > >> >> > in parquet. > >> >> > > >> >> > If people like the idea I can file a JIRA and update the current > >> >> layout.md. > >> >> > > >> >> > Thanks, > >> >> > -Micah > >> >> > > > > >