Thank you for catching this inconsistency with the memory
representation -- I created
https://issues.apache.org/jira/browse/ARROW-62 to track the issue so
we can make a decision (and IMHO we should amend the format documents
to match the Java memory representation).

On Wed, Mar 9, 2016 at 11:31 PM, Daniel Robinson
<danrobinson...@gmail.com> wrote:
> Thanks, Wes. For hashing, my thought was that you could just hash the null
> bitmask concatenated with the values array (this should be doable in
> constant space for ordinary one-pass hash functions). If nulls are zeroed
> out, this will guarantee that equal primitive arrays hash to the same
> value. But if the empty slots in the value array are undefined and could
> have any value, this method would give you false negatives, and you would
> need to do something more complex and slower.
>
> I was also curious about how this is treated in Drill ValueVectors. The
> example illustration in the introductory explanation at
> https://drill.apache.org/docs/value-vectors/ uses 0 for nulled values,
> whatever that's worth.
>
> This is a tangent, but I noticed Drill (and thus for the moment Java Arrow:
> https://github.com/apache/arrow/blob/master/java/vector/src/main/codegen/templates/NullableValueVectors.java#L369)
> uses
> 0 in the null bitmask to represent null values, while you use the NumPy
> MaskedArray convention of having 1 represent null values. Is there an
> advantage?
>
> On Wednesday, March 9, 2016, Wes McKinney <w...@cloudera.com> wrote:
>
>> hey Dan,
>>
>> You bring up a good point. It's unclear whether this would make
>> hashing much less complex (if you ignore the null bits when computing
>> the hash function, then it probably would, especially on repeated
>> fields. We'd need to do some experiments to see if there are cases
>> where skipping the null bits would yield bad hashes. I'm not sure)
>>
>> For the memory comparison question, I'd like to hear from the Drill
>> team on their experience with ValueVectors and how they handle the
>> "null" space in arrays. In the case of list / repeated values, you
>> *do* need to initialize the null offset slot to the previous value so
>> that offset computations always work. For example, the list<int32>
>> array
>>
>> [[0, 1, 2], null, null, [4, 5, 6]]
>>
>> would need to have
>>
>> nulls [LSB right-to-left ordering]: 0 0 0 0 0 1 1 0
>> offsets: 0 3 3 3 6
>> values 0 1 2 4 5 6
>>
>> - Wes
>>
>> On Mon, Mar 7, 2016 at 4:08 PM, Daniel Robinson
>> <danrobinson...@gmail.com> wrote:
>> > Am I correct that under the current draft spec, the values in meaningless
>> > slots (such as slots corresponding to a 1 in an associated null bitmask,
>> or
>> > unused slots in a sparse union) are undefined?
>> >
>> > If so, might it be worth considering requiring that they be zeroed out?
>> > This would add some time overhead to array building (for sparse unions,
>> > memory could be zeroed out when it is allocated; for nullable arrays, it
>> > may be more efficient to do so when nulls are added), but would remove
>> most
>> > of the complexity from hash calculations and equality comparisons. (See
>> > ARROW-32 and ARROW-38.)  For example, comparison of nullable primitive
>> > arrays could be done with 2 calls to memcmp.  As a bonus, it would speed
>> up
>> > operations for which null and 0 are equivalent or near-equivalent (such
>> as
>> > sum, sort on unsigned integers, ).
>> >
>> > Other than the overhead, one potential downside is that it would make it
>> > impossible to just add a null_bitmask over an existing array without
>> > copying all of the memory. Perhaps this would best be done with a
>> > separately defined masking vector, which would not require that all
>> masked
>> > values be set to 0 (and would thus require more complex hashing
>> algorithms).
>>

Reply via email to