I think allowing out of range values would be a mistake.  Doesn't a change
in the metadata version indicates a non backwards compatible change?

On Tuesday, July 14, 2020, Antoine Pitrou <anto...@python.org> wrote:

>
> I think Micah is right.  Also, it seems (from checking the source) that
> the Flatbuffers verifier doesn't check that enums are in range, so we
> may possibly allow out-of-range values and interpret them as "highest
> supported version".
>
> Regards
>
> Antoine.
>
>
> Le 14/07/2020 à 00:53, Micah Kornfield a écrit :
> > To clarify on UBSAN and enums.  My understanding is:
> >
> > enum A { a = 1, b =2, c = 3};
> > class enum B : int16_t { a = 1, b = 2, c = 3};
> >
> > A a = static_cast<A>(4); // UB
> > B b = static_cast<B>(4); // Not UB.  Declaring the holding type makes
> this
> > allowable.
> >
> >
> > On Mon, Jul 13, 2020 at 3:44 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> Please see [1].  I ran this arrow-ipc-read-write-test with UBSAN enabled
> >> and it passed (this isn't my normal dev environment so please double
> check).
> >>
> >>
> >> https://github.com/emkornfield/arrow/commit/
> 7fbd0fb95f7ea164284720428c7974b87b4b2443
> >>
> >> On Mon, Jul 13, 2020 at 3:12 PM Micah Kornfield <emkornfi...@gmail.com>
> >> wrote:
> >>
> >>> I think this might be more complicated, let me see if i can write a
> test
> >>> that demonstrates what I'm talking about.
> >>>
> >>> On Mon, Jul 13, 2020 at 3:10 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>>
> >>>> Here's a patch that does the check
> >>>>
> >>>>
> >>>> https://github.com/wesm/arrow/commit/5bfdb4255a66a4ec62b1c36ba07682
> fad47df9a7
> >>>>
> >>>> Here is a serialized schema that uses a V6 version
> >>>>
> >>>>
> >>>> https://drive.google.com/file/d/1GiWh5yKXdMaLRWU5K4cnGW2ilybF0
> LF_/view?usp=sharing
> >>>>
> >>>> See in action
> >>>> https://gist.github.com/wesm/f9621a626d56491b0bd6c8a131acf518
> >>>>
> >>>> This seems hacky to me, but maybe it's OK?
> >>>>
> >>>> On Mon, Jul 13, 2020 at 4:53 PM Wes McKinney <wesmck...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> On Mon, Jul 13, 2020 at 4:43 PM Micah Kornfield <
> emkornfi...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>> We don't have any test cases that have a future metadata version. I
> >>>>>>> made a branch where I added V6 and wrote an IPC message, then found
> >>>>>>> that I was unable to determine that it was out of bounds
> (presumably
> >>>>>>> UBSAN would error, though, but we need a runtime error outside of
> >>>>>>> ASAN/UBSAN).
> >>>>>>
> >>>>>> To clarify I don't think UBSAN will error on the existing generated
> >>>> code on future versions. I believe we had issues with parquet because
> the
> >>>> enums did not have an explicit type (compare [1] to [2]) .  The
> version
> >>>> check needs to be done in our code (comparing against MAX [3]).
> >>>>>>
> >>>>>> Does that align with your expectations?  So we don't get this for
> >>>> free, but I'm not sure I understand why this is difficult?
> >>>>>
> >>>>> If the metadata version comes through as the int16_t value 5
> >>>>> (currently 4 == V5), how do you get to a runtime error? The generated
> >>>>> Flatbuffers code is doing a static_cast of 5 to the enum which is UB.
> >>>>> Maybe I just don't know what I'm doing. It does not appear to be
> >>>>> possible to obtain the raw int16_t value without doing some kind of
> >>>>> hacking (e.g. reinterpret_cast of Message* to flatbuffers::Table* and
> >>>>> using GetField<int16_t>(VT_VERSION, 0))
> >>>>>
> >>>>> I can make a binary file that uses the currently non-existent V6 so
> >>>>> you can try to detect it and
> >>>>> raise an error
> >>>>>
> >>>>>
> >>>>>
> >>>>>> [1]
> >>>> https://github.com/apache/arrow/blob/master/cpp/src/
> generated/parquet_types.h#L26
> >>>>>> [2]
> >>>> https://github.com/apache/arrow/blob/master/cpp/src/
> generated/Schema_generated.h#L91
> >>>>>> [3]
> >>>> https://github.com/apache/arrow/blob/master/cpp/src/
> generated/Schema_generated.h#L109
> >>>>>>
> >>>>>> On Mon, Jul 13, 2020 at 2:34 PM Wes McKinney <wesmck...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>> On Mon, Jul 13, 2020 at 4:31 PM Micah Kornfield <
> >>>> emkornfi...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> That static cast is currently undefined behavior.
> >>>>>>>>
> >>>>>>>> Is ubsan reporting this?  When looking into the feature enum I
> >>>> tried to
> >>>>>>>> understand if that was valid. At the time I read the C++ spec* if
> >>>> the enum
> >>>>>>>> has an explicitly declared type, all values in that types range
> >>>> are
> >>>>>>>> supported.
> >>>>>>>
> >>>>>>> We don't have any test cases that have a future metadata version. I
> >>>>>>> made a branch where I added V6 and wrote an IPC message, then found
> >>>>>>> that I was unable to determine that it was out of bounds
> (presumably
> >>>>>>> UBSAN would error, though, but we need a runtime error outside of
> >>>>>>> ASAN/UBSAN).
> >>>>>>>
> >>>>>>>> The generated enums provide  a "max" [1] value that should be
> >>>> comparable
> >>>>>>>> against.
> >>>>>>>> <
> >>>> https://github.com/apache/arrow/blob/master/cpp/src/
> generated/Schema_generated.h#L109
> >>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> * I am not a C++ lawyer
> >>>>>>>>
> >>>>>>>> [1]
> >>>>>>>>
> >>>> https://github.com/apache/arrow/blob/master/cpp/src/
> generated/Schema_generated.h#L109
> >>>>>>>>
> >>>>>>>> On Mon, Jul 13, 2020 at 2:19 PM Wes McKinney <wesmck...@gmail.com
> >
> >>>> wrote:
> >>>>>>>>
> >>>>>>>>> I've discovered while working on ARROW-9399 that it is very
> >>>> difficult
> >>>>>>>>> with the Flatbuffers API in C++ to detect a MetadataVersion [1]
> >>>> that
> >>>>>>>>> is higher than the current version.
> >>>>>>>>>
> >>>>>>>>> For example, suppose that 3 or 4 years from now we move from
> >>>> version
> >>>>>>>>> V5 to version V6. The generated Flatbuffers code looks like this
> >>>>>>>>>
> >>>>>>>>> org::apache::arrow::flatbuf::MetadataVersion version() const {
> >>>>>>>>>   return
> >>>>>>>>>
> >>>> static_cast<org::apache::arrow::flatbuf::MetadataVersion>(GetField<
> int16_t>(VT_VERSION,
> >>>>>>>>> 0));
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> That static cast is currently undefined behavior.
> >>>>>>>>>
> >>>>>>>>> One way to deal with this would be to add placeholder future
> >>>> versions
> >>>>>>>>> (e.g. V6 and V7).
> >>>>>>>>>
> >>>>>>>>> Another backward-and-forward-compatible option would be to
> >>>> return the
> >>>>>>>>> version as a short (int16_t) rather than the enum value, which
> >>>> is
> >>>>>>>>> subject to this brittleness.
> >>>>>>>>>
> >>>>>>>>> Either way unfortunately I think adding forward compatibility
> >>>> checks
> >>>>>>>>> is out of scope for 1.0.0 and the risk is low since we don't
> >>>>>>>>> anticipate bumping the version anytime soon.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Wes
> >>>>>>>>>
> >>>>>>>>> [1]:
> >>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22
> >>>>>>>>>
> >>>>
> >>>
> >
>

Reply via email to