I think allowing out of range values would be a mistake. Doesn't a change in the metadata version indicates a non backwards compatible change?
On Tuesday, July 14, 2020, Antoine Pitrou <anto...@python.org> wrote: > > I think Micah is right. Also, it seems (from checking the source) that > the Flatbuffers verifier doesn't check that enums are in range, so we > may possibly allow out-of-range values and interpret them as "highest > supported version". > > Regards > > Antoine. > > > Le 14/07/2020 à 00:53, Micah Kornfield a écrit : > > To clarify on UBSAN and enums. My understanding is: > > > > enum A { a = 1, b =2, c = 3}; > > class enum B : int16_t { a = 1, b = 2, c = 3}; > > > > A a = static_cast<A>(4); // UB > > B b = static_cast<B>(4); // Not UB. Declaring the holding type makes > this > > allowable. > > > > > > On Mon, Jul 13, 2020 at 3:44 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > >> Please see [1]. I ran this arrow-ipc-read-write-test with UBSAN enabled > >> and it passed (this isn't my normal dev environment so please double > check). > >> > >> > >> https://github.com/emkornfield/arrow/commit/ > 7fbd0fb95f7ea164284720428c7974b87b4b2443 > >> > >> On Mon, Jul 13, 2020 at 3:12 PM Micah Kornfield <emkornfi...@gmail.com> > >> wrote: > >> > >>> I think this might be more complicated, let me see if i can write a > test > >>> that demonstrates what I'm talking about. > >>> > >>> On Mon, Jul 13, 2020 at 3:10 PM Wes McKinney <wesmck...@gmail.com> > wrote: > >>> > >>>> Here's a patch that does the check > >>>> > >>>> > >>>> https://github.com/wesm/arrow/commit/5bfdb4255a66a4ec62b1c36ba07682 > fad47df9a7 > >>>> > >>>> Here is a serialized schema that uses a V6 version > >>>> > >>>> > >>>> https://drive.google.com/file/d/1GiWh5yKXdMaLRWU5K4cnGW2ilybF0 > LF_/view?usp=sharing > >>>> > >>>> See in action > >>>> https://gist.github.com/wesm/f9621a626d56491b0bd6c8a131acf518 > >>>> > >>>> This seems hacky to me, but maybe it's OK? > >>>> > >>>> On Mon, Jul 13, 2020 at 4:53 PM Wes McKinney <wesmck...@gmail.com> > >>>> wrote: > >>>>> > >>>>> On Mon, Jul 13, 2020 at 4:43 PM Micah Kornfield < > emkornfi...@gmail.com> > >>>> wrote: > >>>>>>> > >>>>>>> We don't have any test cases that have a future metadata version. I > >>>>>>> made a branch where I added V6 and wrote an IPC message, then found > >>>>>>> that I was unable to determine that it was out of bounds > (presumably > >>>>>>> UBSAN would error, though, but we need a runtime error outside of > >>>>>>> ASAN/UBSAN). > >>>>>> > >>>>>> To clarify I don't think UBSAN will error on the existing generated > >>>> code on future versions. I believe we had issues with parquet because > the > >>>> enums did not have an explicit type (compare [1] to [2]) . The > version > >>>> check needs to be done in our code (comparing against MAX [3]). > >>>>>> > >>>>>> Does that align with your expectations? So we don't get this for > >>>> free, but I'm not sure I understand why this is difficult? > >>>>> > >>>>> If the metadata version comes through as the int16_t value 5 > >>>>> (currently 4 == V5), how do you get to a runtime error? The generated > >>>>> Flatbuffers code is doing a static_cast of 5 to the enum which is UB. > >>>>> Maybe I just don't know what I'm doing. It does not appear to be > >>>>> possible to obtain the raw int16_t value without doing some kind of > >>>>> hacking (e.g. reinterpret_cast of Message* to flatbuffers::Table* and > >>>>> using GetField<int16_t>(VT_VERSION, 0)) > >>>>> > >>>>> I can make a binary file that uses the currently non-existent V6 so > >>>>> you can try to detect it and > >>>>> raise an error > >>>>> > >>>>> > >>>>> > >>>>>> [1] > >>>> https://github.com/apache/arrow/blob/master/cpp/src/ > generated/parquet_types.h#L26 > >>>>>> [2] > >>>> https://github.com/apache/arrow/blob/master/cpp/src/ > generated/Schema_generated.h#L91 > >>>>>> [3] > >>>> https://github.com/apache/arrow/blob/master/cpp/src/ > generated/Schema_generated.h#L109 > >>>>>> > >>>>>> On Mon, Jul 13, 2020 at 2:34 PM Wes McKinney <wesmck...@gmail.com> > >>>> wrote: > >>>>>>> > >>>>>>> On Mon, Jul 13, 2020 at 4:31 PM Micah Kornfield < > >>>> emkornfi...@gmail.com> wrote: > >>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> That static cast is currently undefined behavior. > >>>>>>>> > >>>>>>>> Is ubsan reporting this? When looking into the feature enum I > >>>> tried to > >>>>>>>> understand if that was valid. At the time I read the C++ spec* if > >>>> the enum > >>>>>>>> has an explicitly declared type, all values in that types range > >>>> are > >>>>>>>> supported. > >>>>>>> > >>>>>>> We don't have any test cases that have a future metadata version. I > >>>>>>> made a branch where I added V6 and wrote an IPC message, then found > >>>>>>> that I was unable to determine that it was out of bounds > (presumably > >>>>>>> UBSAN would error, though, but we need a runtime error outside of > >>>>>>> ASAN/UBSAN). > >>>>>>> > >>>>>>>> The generated enums provide a "max" [1] value that should be > >>>> comparable > >>>>>>>> against. > >>>>>>>> < > >>>> https://github.com/apache/arrow/blob/master/cpp/src/ > generated/Schema_generated.h#L109 > >>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> * I am not a C++ lawyer > >>>>>>>> > >>>>>>>> [1] > >>>>>>>> > >>>> https://github.com/apache/arrow/blob/master/cpp/src/ > generated/Schema_generated.h#L109 > >>>>>>>> > >>>>>>>> On Mon, Jul 13, 2020 at 2:19 PM Wes McKinney <wesmck...@gmail.com > > > >>>> wrote: > >>>>>>>> > >>>>>>>>> I've discovered while working on ARROW-9399 that it is very > >>>> difficult > >>>>>>>>> with the Flatbuffers API in C++ to detect a MetadataVersion [1] > >>>> that > >>>>>>>>> is higher than the current version. > >>>>>>>>> > >>>>>>>>> For example, suppose that 3 or 4 years from now we move from > >>>> version > >>>>>>>>> V5 to version V6. The generated Flatbuffers code looks like this > >>>>>>>>> > >>>>>>>>> org::apache::arrow::flatbuf::MetadataVersion version() const { > >>>>>>>>> return > >>>>>>>>> > >>>> static_cast<org::apache::arrow::flatbuf::MetadataVersion>(GetField< > int16_t>(VT_VERSION, > >>>>>>>>> 0)); > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> That static cast is currently undefined behavior. > >>>>>>>>> > >>>>>>>>> One way to deal with this would be to add placeholder future > >>>> versions > >>>>>>>>> (e.g. V6 and V7). > >>>>>>>>> > >>>>>>>>> Another backward-and-forward-compatible option would be to > >>>> return the > >>>>>>>>> version as a short (int16_t) rather than the enum value, which > >>>> is > >>>>>>>>> subject to this brittleness. > >>>>>>>>> > >>>>>>>>> Either way unfortunately I think adding forward compatibility > >>>> checks > >>>>>>>>> is out of scope for 1.0.0 and the risk is low since we don't > >>>>>>>>> anticipate bumping the version anytime soon. > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Wes > >>>>>>>>> > >>>>>>>>> [1]: > >>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22 > >>>>>>>>> > >>>> > >>> > > >