TP Boudreau created ARROW-5618:
----------------------------------

             Summary: Using deprecated Int96 storage for timestamps triggers 
integer overflow in some cases
                 Key: ARROW-5618
                 URL: https://issues.apache.org/jira/browse/ARROW-5618
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: TP Boudreau


When storing Arrow timestamps in Parquet files using the Int96 storage format, 
certain combinations of array lengths and validity bitmasks cause an integer 
overflow error on read.  It's not immediately clear whether the Arrow/Parquet 
writer is storing zeroes when it should be storing positive values or the 
reader is attempting to calculate a nanoseconds value inappropriately from 
zeroed inputs (perhaps missing the null bit flag).  Also not immediately clear 
why only certain length columns seem to be affected.

Probably the quickest way to reproduce this undefined behavior is to alter the 
existing unit test UseDeprecatedInt96 (in file 
.../arrow/cpp/src/parquet/arrow/arrow-reader-writer-test.cc) by quadrupling its 
column lengths (repeating the same values), followed by 'make unittest' using 
clang-7 with sanitizers enabled.  (Here's a patch applicable to current master 
that changes the test as described: [1]; I used the following cmake command to 
build my environment: [2].)  You should get a log something like [3].  If 
requested, I'll see if I can put together a stand-alone minimal test case that 
induces the behavior.

The quick-hack at [4] will prevent integer overflows, but this is only included 
to confirm the proximate cause of the bug: the Julian days field of the Int96 
appears to be zero, when a strictly positive number is expected.

I've assigned the issue to myself and I'll start looking into the root cause of 
this.

[1] https://gist.github.com/tpboudreau/b6610c13cbfede4d6b171da681d1f94e
[2] https://gist.github.com/tpboudreau/59178ca8cb50a935aab7477805aa32b9
[3] https://gist.github.com/tpboudreau/0c2d0a18960c1aa04c838fa5c2ac7d2d
[4] https://gist.github.com/tpboudreau/0993beb5c8c1488028e76fb2ca179b7f



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to