Re: [C++] Failing constructors and internal state

Edmon Begoli Sun, 10 Mar 2019 16:29:09 -0700

Do you guys have an example somewhere of this validated vs. unvalidated
code, and suspected performance impacts, and has anyone benchmarked any of
this?




On Sun, Mar 10, 2019 at 5:45 PM Wes McKinney <[email protected]> wrote:

> I think having consistent methods for both validated and unvalidated
> construction is a good idea. Being fairly passionate about
> microperformance, I don't think we should penalize responsible users
> of unsafe/unvalidated APIs (e.g. by taking them away and replacing
> them with variants featuring unavoidable computation), this can
> partially be handled through developer documentation (which, ah, we
> will need to write more of)
>
> On Sun, Mar 10, 2019 at 4:01 PM Micah Kornfield <[email protected]>
> wrote:
> >
> > I agree there should always be a path to avoid the validation but I
> think there should also be an easy way to have validation included and a
> clear way to tell the difference.  IMO, having strong naming convention so
> callers can tell the difference, and code reviewers can focus more on less
> safe method usage, is important.  I will help new-comers to the project
> write safer code.  Which can either be refactored or called out in code
> review for performance issues.  It also provides a cue for all developers
> to consider if they are meeting the necessary requirements when using less
> safe methods.
> >
> > A straw-man proposal for naming conventions:
> > - Constructors are always unvalidated (should still have appropriate
> DCHECKs)
> > - "Make" calls are always unvalidated (should still have appropriate
> DHCECKs)
> > - "MakeValidated" ensures proper structural validation occur (but not
> data is validation).
> > - "MakeSanitized" ensures proper structural and data is validations occur
> >
> > As noted above it might only pay to refactor a small amount of current
> usage to the safer APIs.
> >
> > We could potentially go even further down the rabbit hole and try to
> define standard for a Hungarian notation [1] to make it more obvious what
> invariants are expected for a particular data-structure variable (I'm
> actually -.5 on this).
> >
> > As a personal bias, I would rather have slower code that has lower risk
> of crashing in production than faster code that does.  Obviously, there is
> a tradeoff here, and the ideal is faster code that won't segfault.
> >
> > Thoughts?
> >
> > -Micah
> >
> > [1]
> https://www.joelonsoftware.com/2005/05/11/making-wrong-code-look-wrong/
> >
> > On Sun, Mar 10, 2019 at 9:38 AM Wes McKinney <[email protected]>
> wrote:
> >>
> >> hi folks,
> >>
> >> I think some issues are being conflated here, so let me try to dig
> >> through them. Let's first look at the two cited bugs that were fixed,
> >> if I have this right:
> >>
> >> * ARROW-4766: root cause dereferencing a null pointer
> >> * ARROW-4774: root cause unsanitized Python user input
> >>
> >> None of the 4 remedies listed could have prevented ARROW-4766 AFAICT
> >> since we currently allow for null buffers (the object, not the pointer
> >> inside) in ArrayData. This has been discussed on the mailing list in
> >> the past; "sanitizing" ArrayData to be free of null objects would be
> >> expensive and my general attitude in the C++ library is that we should
> >> not be in the business of paying extra CPU cycles for the 1-5% case
> >> when it is unneeded in the 95-99% of cases. We have DCHECK assertions
> >> to check these issues in debug builds while avoiding the costs in
> >> release builds. In the case of checking edge cases in computational
> >> kernels, suffice to say that we should check every kernel on length-0
> >> input with null buffers to make sure this case is properly handled
> >>
> >> In the case of ARROW-4774, we should work at the language binding
> >> interface to make sure we have convenient "validating" constructors
> >> that check user input for common problems. This can prevent the
> >> duplication of this code across the binding layers (GLib, Python, R,
> >> MATLAB, etc.)
> >>
> >>  On the specific 4 steps mentioned by Francois, here are my thoughts:
> >>
> >> 1. Having StatusOr would be useful, but this is a programming
> convenience
> >>
> >> 2. There are a couple purposes of the static factory methods that
> >> exist now, like Table::Make and RecordBatch::Make. One of the reasons
> >> that I added them initially was because of the implicit constructor
> >> behavior of std::vector inside a call to std::make_shared. If you have
> >> a std::vector<T> argument in a class's constructor, then
> >> std::make_shared<Klass>(..., {...}) will not result in the initializer
> >> list constructing the std::vector<T>. This means some awkwardness like
> >> having to assign a std::vector<T> lvalue and _then_ pass that to
> >> std::make_shared<Klass>(..., vector_arg, ...).
> >>
> >> I do not agree with refactoring these methods to use "validating"
> >> constructors. Users of these C++ APIs should know what their
> >> requirements are, and we provide in some cases a Validate() to spend
> >> the extra cycles to assert preconditions. Therefore:
> >>
> >> 3. -1 on this
> >> 4. -1 also
> >>
> >> Thanks
> >> Wes
> >>
> >> On Sat, Mar 9, 2019 at 9:58 PM Micah Kornfield <[email protected]>
> wrote:
> >> >
> >> > HI François,
> >> > This sounds great.  I would hope that as part of this we document the
> >> > invariants (and possible sharp edges like zero length/all null and no
> null
> >> > Arrays).
> >> >
> >> > Is your intent to allow languages binding to the C++ library go
> through the
> >> > new API or will they still be able to use the "dangerous" ones?
> >> >
> >> > -Micah
> >> >
> >> > On Fri, Mar 8, 2019 at 6:16 PM Francois Saint-Jacques <
> >> > [email protected]> wrote:
> >> >
> >> > > Greetings,
> >> > >
> >> > > I noted that the current C++ API permits constructing core objects
> breaking
> >> > > said classes invariants. The following recent issues were affected
> by this:
> >> > >
> >> > > - ARROW-4766: segfault due to invalid ArrayData with nullptr buffer
> >> > > - ARROW-4774: segfault due to invalid Table with columns of
> different
> >> > > length
> >> > >
> >> > > Consumers often assumes that the objects respect the invariants,
> e.g. by
> >> > > dereferencing `array_data->buffers[i]->data()` or
> >> > > `Array::GetValue(table.n_rows - 1)` .
> >> > >
> >> > > Sample of classes which requires invariant in the constructor:
> >> > > - ArrayData/Array: number and size of buffers depending on type
> >> > > - ChunkedArray: Arrays of same type
> >> > > - Column: same as ChunkedArray and Field must match array's type
> >> > > - RecordBatch: number of columns and schema must match, columns of
> same
> >> > > size
> >> > > - Table: columns must be of same size
> >> > >
> >> > > Some classes provide static factory methods, notably:
> >> > > - Most `shared_ptr<T> *::Make` methods, but they lack the Status
> return
> >> > > type to indicate
> >> > >   failure, the consumer can at least check for nullptr
> >> > > - `Status Table::FromRecordBatches(..., shared_ptr<T> *out)` is a
> good
> >> > > candidate to follow
> >> > >
> >> > > I suspect that mis-usage is only going to grow with the number of
> users and
> >> > > language that binds to C++. I would like to propose a plan to
> tackle for
> >> > > the
> >> > > 0.14.0 release
> >> > >
> >> > > 1. Implement `StatusOr` (ARROW-4800), providing a cleaner API by
> minimizing
> >> > >    output parameters.
> >> > > 2. Refactor normalized factory methods for each core object
> (ArrayData,
> >> > >    ChunkedArray, Column, RecordBatch, Table)
> >> > >     - Common naming, I suggest we stay with `Make`.
> >> > >     - Common return type, `StatusOr<shared_ptr<T>>`
> >> > > 3. Refactor existing Make methods to use new methods but preserve
> original
> >> > >    signature by losing error message, on top of marking them
> deprecated.
> >> > > 4. Mark non-validating constructors as deprecated and ideailly make
> every
> >> > > "dangerous" constructor non-public.
> >> > >
> >> > > We'd give 1-2 release for consumers to stop using the deprecated
> >> > > methods/constructors.
> >> > >
> >> > > François
> >> > >
>

Re: [C++] Failing constructors and internal state

Reply via email to