Thank you for catching this inconsistency with the memory representation -- I created https://issues.apache.org/jira/browse/ARROW-62 to track the issue so we can make a decision (and IMHO we should amend the format documents to match the Java memory representation).
On Wed, Mar 9, 2016 at 11:31 PM, Daniel Robinson <danrobinson...@gmail.com> wrote: > Thanks, Wes. For hashing, my thought was that you could just hash the null > bitmask concatenated with the values array (this should be doable in > constant space for ordinary one-pass hash functions). If nulls are zeroed > out, this will guarantee that equal primitive arrays hash to the same > value. But if the empty slots in the value array are undefined and could > have any value, this method would give you false negatives, and you would > need to do something more complex and slower. > > I was also curious about how this is treated in Drill ValueVectors. The > example illustration in the introductory explanation at > https://drill.apache.org/docs/value-vectors/ uses 0 for nulled values, > whatever that's worth. > > This is a tangent, but I noticed Drill (and thus for the moment Java Arrow: > https://github.com/apache/arrow/blob/master/java/vector/src/main/codegen/templates/NullableValueVectors.java#L369) > uses > 0 in the null bitmask to represent null values, while you use the NumPy > MaskedArray convention of having 1 represent null values. Is there an > advantage? > > On Wednesday, March 9, 2016, Wes McKinney <w...@cloudera.com> wrote: > >> hey Dan, >> >> You bring up a good point. It's unclear whether this would make >> hashing much less complex (if you ignore the null bits when computing >> the hash function, then it probably would, especially on repeated >> fields. We'd need to do some experiments to see if there are cases >> where skipping the null bits would yield bad hashes. I'm not sure) >> >> For the memory comparison question, I'd like to hear from the Drill >> team on their experience with ValueVectors and how they handle the >> "null" space in arrays. In the case of list / repeated values, you >> *do* need to initialize the null offset slot to the previous value so >> that offset computations always work. For example, the list<int32> >> array >> >> [[0, 1, 2], null, null, [4, 5, 6]] >> >> would need to have >> >> nulls [LSB right-to-left ordering]: 0 0 0 0 0 1 1 0 >> offsets: 0 3 3 3 6 >> values 0 1 2 4 5 6 >> >> - Wes >> >> On Mon, Mar 7, 2016 at 4:08 PM, Daniel Robinson >> <danrobinson...@gmail.com> wrote: >> > Am I correct that under the current draft spec, the values in meaningless >> > slots (such as slots corresponding to a 1 in an associated null bitmask, >> or >> > unused slots in a sparse union) are undefined? >> > >> > If so, might it be worth considering requiring that they be zeroed out? >> > This would add some time overhead to array building (for sparse unions, >> > memory could be zeroed out when it is allocated; for nullable arrays, it >> > may be more efficient to do so when nulls are added), but would remove >> most >> > of the complexity from hash calculations and equality comparisons. (See >> > ARROW-32 and ARROW-38.) For example, comparison of nullable primitive >> > arrays could be done with 2 calls to memcmp. As a bonus, it would speed >> up >> > operations for which null and 0 are equivalent or near-equivalent (such >> as >> > sum, sort on unsigned integers, ). >> > >> > Other than the overhead, one potential downside is that it would make it >> > impossible to just add a null_bitmask over an existing array without >> > copying all of the memory. Perhaps this would best be done with a >> > separately defined masking vector, which would not require that all >> masked >> > values be set to 0 (and would thus require more complex hashing >> algorithms). >>