Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

2017-11-09 Thread Wes McKinney
I opened https://issues.apache.org/jira/browse/ARROW-1785 https://issues.apache.org/jira/browse/ARROW-1786 I can take the liberty of removing the metadata per ARROW-1785 in the next few days if there are no objections. We will want to add documentation to indicate which buffers must accompany eac

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

2017-11-09 Thread Li Jin
However, this is currently broken in java refactor branch. I am fixing this in https://issues.apache.org/jira/browse/ARROW-1779 On Thu, Nov 9, 2017 at 12:32 PM, Li Jin wrote: > If null count is 0, the java library sets the validity vectors to all 1s. > > https://github.com/apache/arrow/blob/mast

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

2017-11-09 Thread Li Jin
If null count is 0, the java library sets the validity vectors to all 1s. https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java#L61 On Thu, Nov 9, 2017 at 12:23 PM, Wes McKinney wrote: > Yep, see https://github.com/apache/arrow/blob/master/

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

2017-11-09 Thread Wes McKinney
Yep, see https://github.com/apache/arrow/blob/master/format/Layout.md#null-bitmaps "Arrays having a 0 null count may choose to not allocate the null bitmap." I do not know what the Java library will do in the event of 0 null count and 0-length validity bitmap -- in theory this should be accounte

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

2017-11-09 Thread Brian Hulette
Ah! It didn't occur to me that a producer could just send a length-0 buffer since the reader implementations should ignore it anyway. I don't mind the 16 byte cost of the metadata - I was referring to the bloat of a 100% valid vector, which could be substantial. Part of me wants to argue that

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

2017-11-09 Thread Wes McKinney
> So I'll go after the other validity vector - maybe producers should be > allowed to omit the validity vector in the index? I just think if the goal is > to reduce bloat then redundant validity vectors seems like a logical place to > trim. Well, the cost of the additional buffer metadata is on

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

2017-11-09 Thread Brian Hulette
Good point. Its a nice feature of the format that a dictionary batch and a record batch with a single column look exactly the same when they represent the same logical type. So I'll go after the other validity vector - maybe producers should be allowed to omit the validity vector in the index

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

2017-11-08 Thread Wes McKinney
The dictionary batches simply wrap a record batch with one “column”. There should be no code difference (e.g. buffer layouts are the same) between the code handling the data in a dictionary and a normal record batches. In general, a dictionary may contain a null. On Wed, Nov 8, 2017 at 4:05 PM Bri

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

2017-11-08 Thread Brian Hulette
Agreed, that sounds like a great solution to this problem - the layout information is redundant and it doesn't make sense to include it in every schema. Although I would argue we should write down exactly what buffers are supposed to go on the wire in the dictionary batches (i.e. value vector

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

2017-11-08 Thread Wes McKinney
Per Jacques' comment in ARROW-1693 https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244812#comment-16244812, I think we should remove the buffer layout from the metadata. It would be a good idea to do this for

[DISCUSS] Buffer Layouts and Dictionary Vectors

2017-11-06 Thread Brian Hulette
We've been having some integration issues with reading Dictionary Vectors in the JS implementation - our current implementation can read arrow files and streams generated by Java, but not by C++. Most of this discussion is captured in ARROW-1693 [1]. It looks like ultimately the issue is that