I think having consistent methods for both validated and unvalidated
construction is a good idea. Being fairly passionate about
microperformance, I don't think we should penalize responsible users
of unsafe/unvalidated APIs (e.g. by taking them away and replacing
them with variants featuring unavoidable computation), this can
partially be handled through developer documentation (which, ah, we
will need to write more of)

On Sun, Mar 10, 2019 at 4:01 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> I agree there should always be a path to avoid the validation but I think 
> there should also be an easy way to have validation included and a clear way 
> to tell the difference.  IMO, having strong naming convention so callers can 
> tell the difference, and code reviewers can focus more on less safe method 
> usage, is important.  I will help new-comers to the project write safer code. 
>  Which can either be refactored or called out in code review for performance 
> issues.  It also provides a cue for all developers to consider if they are 
> meeting the necessary requirements when using less safe methods.
>
> A straw-man proposal for naming conventions:
> - Constructors are always unvalidated (should still have appropriate DCHECKs)
> - "Make" calls are always unvalidated (should still have appropriate DHCECKs)
> - "MakeValidated" ensures proper structural validation occur (but not data is 
> validation).
> - "MakeSanitized" ensures proper structural and data is validations occur
>
> As noted above it might only pay to refactor a small amount of current usage 
> to the safer APIs.
>
> We could potentially go even further down the rabbit hole and try to define 
> standard for a Hungarian notation [1] to make it more obvious what invariants 
> are expected for a particular data-structure variable (I'm actually -.5 on 
> this).
>
> As a personal bias, I would rather have slower code that has lower risk of 
> crashing in production than faster code that does.  Obviously, there is a 
> tradeoff here, and the ideal is faster code that won't segfault.
>
> Thoughts?
>
> -Micah
>
> [1] https://www.joelonsoftware.com/2005/05/11/making-wrong-code-look-wrong/
>
> On Sun, Mar 10, 2019 at 9:38 AM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>> hi folks,
>>
>> I think some issues are being conflated here, so let me try to dig
>> through them. Let's first look at the two cited bugs that were fixed,
>> if I have this right:
>>
>> * ARROW-4766: root cause dereferencing a null pointer
>> * ARROW-4774: root cause unsanitized Python user input
>>
>> None of the 4 remedies listed could have prevented ARROW-4766 AFAICT
>> since we currently allow for null buffers (the object, not the pointer
>> inside) in ArrayData. This has been discussed on the mailing list in
>> the past; "sanitizing" ArrayData to be free of null objects would be
>> expensive and my general attitude in the C++ library is that we should
>> not be in the business of paying extra CPU cycles for the 1-5% case
>> when it is unneeded in the 95-99% of cases. We have DCHECK assertions
>> to check these issues in debug builds while avoiding the costs in
>> release builds. In the case of checking edge cases in computational
>> kernels, suffice to say that we should check every kernel on length-0
>> input with null buffers to make sure this case is properly handled
>>
>> In the case of ARROW-4774, we should work at the language binding
>> interface to make sure we have convenient "validating" constructors
>> that check user input for common problems. This can prevent the
>> duplication of this code across the binding layers (GLib, Python, R,
>> MATLAB, etc.)
>>
>>  On the specific 4 steps mentioned by Francois, here are my thoughts:
>>
>> 1. Having StatusOr would be useful, but this is a programming convenience
>>
>> 2. There are a couple purposes of the static factory methods that
>> exist now, like Table::Make and RecordBatch::Make. One of the reasons
>> that I added them initially was because of the implicit constructor
>> behavior of std::vector inside a call to std::make_shared. If you have
>> a std::vector<T> argument in a class's constructor, then
>> std::make_shared<Klass>(..., {...}) will not result in the initializer
>> list constructing the std::vector<T>. This means some awkwardness like
>> having to assign a std::vector<T> lvalue and _then_ pass that to
>> std::make_shared<Klass>(..., vector_arg, ...).
>>
>> I do not agree with refactoring these methods to use "validating"
>> constructors. Users of these C++ APIs should know what their
>> requirements are, and we provide in some cases a Validate() to spend
>> the extra cycles to assert preconditions. Therefore:
>>
>> 3. -1 on this
>> 4. -1 also
>>
>> Thanks
>> Wes
>>
>> On Sat, Mar 9, 2019 at 9:58 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>> >
>> > HI François,
>> > This sounds great.  I would hope that as part of this we document the
>> > invariants (and possible sharp edges like zero length/all null and no null
>> > Arrays).
>> >
>> > Is your intent to allow languages binding to the C++ library go through the
>> > new API or will they still be able to use the "dangerous" ones?
>> >
>> > -Micah
>> >
>> > On Fri, Mar 8, 2019 at 6:16 PM Francois Saint-Jacques <
>> > fsaintjacq...@gmail.com> wrote:
>> >
>> > > Greetings,
>> > >
>> > > I noted that the current C++ API permits constructing core objects 
>> > > breaking
>> > > said classes invariants. The following recent issues were affected by 
>> > > this:
>> > >
>> > > - ARROW-4766: segfault due to invalid ArrayData with nullptr buffer
>> > > - ARROW-4774: segfault due to invalid Table with columns of different
>> > > length
>> > >
>> > > Consumers often assumes that the objects respect the invariants, e.g. by
>> > > dereferencing `array_data->buffers[i]->data()` or
>> > > `Array::GetValue(table.n_rows - 1)` .
>> > >
>> > > Sample of classes which requires invariant in the constructor:
>> > > - ArrayData/Array: number and size of buffers depending on type
>> > > - ChunkedArray: Arrays of same type
>> > > - Column: same as ChunkedArray and Field must match array's type
>> > > - RecordBatch: number of columns and schema must match, columns of same
>> > > size
>> > > - Table: columns must be of same size
>> > >
>> > > Some classes provide static factory methods, notably:
>> > > - Most `shared_ptr<T> *::Make` methods, but they lack the Status return
>> > > type to indicate
>> > >   failure, the consumer can at least check for nullptr
>> > > - `Status Table::FromRecordBatches(..., shared_ptr<T> *out)` is a good
>> > > candidate to follow
>> > >
>> > > I suspect that mis-usage is only going to grow with the number of users 
>> > > and
>> > > language that binds to C++. I would like to propose a plan to tackle for
>> > > the
>> > > 0.14.0 release
>> > >
>> > > 1. Implement `StatusOr` (ARROW-4800), providing a cleaner API by 
>> > > minimizing
>> > >    output parameters.
>> > > 2. Refactor normalized factory methods for each core object (ArrayData,
>> > >    ChunkedArray, Column, RecordBatch, Table)
>> > >     - Common naming, I suggest we stay with `Make`.
>> > >     - Common return type, `StatusOr<shared_ptr<T>>`
>> > > 3. Refactor existing Make methods to use new methods but preserve 
>> > > original
>> > >    signature by losing error message, on top of marking them deprecated.
>> > > 4. Mark non-validating constructors as deprecated and ideailly make every
>> > > "dangerous" constructor non-public.
>> > >
>> > > We'd give 1-2 release for consumers to stop using the deprecated
>> > > methods/constructors.
>> > >
>> > > François
>> > >

Reply via email to