hi folks,

I think some issues are being conflated here, so let me try to dig
through them. Let's first look at the two cited bugs that were fixed,
if I have this right:

* ARROW-4766: root cause dereferencing a null pointer
* ARROW-4774: root cause unsanitized Python user input

None of the 4 remedies listed could have prevented ARROW-4766 AFAICT
since we currently allow for null buffers (the object, not the pointer
inside) in ArrayData. This has been discussed on the mailing list in
the past; "sanitizing" ArrayData to be free of null objects would be
expensive and my general attitude in the C++ library is that we should
not be in the business of paying extra CPU cycles for the 1-5% case
when it is unneeded in the 95-99% of cases. We have DCHECK assertions
to check these issues in debug builds while avoiding the costs in
release builds. In the case of checking edge cases in computational
kernels, suffice to say that we should check every kernel on length-0
input with null buffers to make sure this case is properly handled

In the case of ARROW-4774, we should work at the language binding
interface to make sure we have convenient "validating" constructors
that check user input for common problems. This can prevent the
duplication of this code across the binding layers (GLib, Python, R,
MATLAB, etc.)

 On the specific 4 steps mentioned by Francois, here are my thoughts:

1. Having StatusOr would be useful, but this is a programming convenience

2. There are a couple purposes of the static factory methods that
exist now, like Table::Make and RecordBatch::Make. One of the reasons
that I added them initially was because of the implicit constructor
behavior of std::vector inside a call to std::make_shared. If you have
a std::vector<T> argument in a class's constructor, then
std::make_shared<Klass>(..., {...}) will not result in the initializer
list constructing the std::vector<T>. This means some awkwardness like
having to assign a std::vector<T> lvalue and _then_ pass that to
std::make_shared<Klass>(..., vector_arg, ...).

I do not agree with refactoring these methods to use "validating"
constructors. Users of these C++ APIs should know what their
requirements are, and we provide in some cases a Validate() to spend
the extra cycles to assert preconditions. Therefore:

3. -1 on this
4. -1 also

Thanks
Wes

On Sat, Mar 9, 2019 at 9:58 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> HI François,
> This sounds great.  I would hope that as part of this we document the
> invariants (and possible sharp edges like zero length/all null and no null
> Arrays).
>
> Is your intent to allow languages binding to the C++ library go through the
> new API or will they still be able to use the "dangerous" ones?
>
> -Micah
>
> On Fri, Mar 8, 2019 at 6:16 PM Francois Saint-Jacques <
> fsaintjacq...@gmail.com> wrote:
>
> > Greetings,
> >
> > I noted that the current C++ API permits constructing core objects breaking
> > said classes invariants. The following recent issues were affected by this:
> >
> > - ARROW-4766: segfault due to invalid ArrayData with nullptr buffer
> > - ARROW-4774: segfault due to invalid Table with columns of different
> > length
> >
> > Consumers often assumes that the objects respect the invariants, e.g. by
> > dereferencing `array_data->buffers[i]->data()` or
> > `Array::GetValue(table.n_rows - 1)` .
> >
> > Sample of classes which requires invariant in the constructor:
> > - ArrayData/Array: number and size of buffers depending on type
> > - ChunkedArray: Arrays of same type
> > - Column: same as ChunkedArray and Field must match array's type
> > - RecordBatch: number of columns and schema must match, columns of same
> > size
> > - Table: columns must be of same size
> >
> > Some classes provide static factory methods, notably:
> > - Most `shared_ptr<T> *::Make` methods, but they lack the Status return
> > type to indicate
> >   failure, the consumer can at least check for nullptr
> > - `Status Table::FromRecordBatches(..., shared_ptr<T> *out)` is a good
> > candidate to follow
> >
> > I suspect that mis-usage is only going to grow with the number of users and
> > language that binds to C++. I would like to propose a plan to tackle for
> > the
> > 0.14.0 release
> >
> > 1. Implement `StatusOr` (ARROW-4800), providing a cleaner API by minimizing
> >    output parameters.
> > 2. Refactor normalized factory methods for each core object (ArrayData,
> >    ChunkedArray, Column, RecordBatch, Table)
> >     - Common naming, I suggest we stay with `Make`.
> >     - Common return type, `StatusOr<shared_ptr<T>>`
> > 3. Refactor existing Make methods to use new methods but preserve original
> >    signature by losing error message, on top of marking them deprecated.
> > 4. Mark non-validating constructors as deprecated and ideailly make every
> > "dangerous" constructor non-public.
> >
> > We'd give 1-2 release for consumers to stop using the deprecated
> > methods/constructors.
> >
> > François
> >

Reply via email to