Greetings,

I noted that the current C++ API permits constructing core objects breaking
said classes invariants. The following recent issues were affected by this:

- ARROW-4766: segfault due to invalid ArrayData with nullptr buffer
- ARROW-4774: segfault due to invalid Table with columns of different length

Consumers often assumes that the objects respect the invariants, e.g. by
dereferencing `array_data->buffers[i]->data()` or
`Array::GetValue(table.n_rows - 1)` .

Sample of classes which requires invariant in the constructor:
- ArrayData/Array: number and size of buffers depending on type
- ChunkedArray: Arrays of same type
- Column: same as ChunkedArray and Field must match array's type
- RecordBatch: number of columns and schema must match, columns of same size
- Table: columns must be of same size

Some classes provide static factory methods, notably:
- Most `shared_ptr<T> *::Make` methods, but they lack the Status return
type to indicate
  failure, the consumer can at least check for nullptr
- `Status Table::FromRecordBatches(..., shared_ptr<T> *out)` is a good
candidate to follow

I suspect that mis-usage is only going to grow with the number of users and
language that binds to C++. I would like to propose a plan to tackle for the
0.14.0 release

1. Implement `StatusOr` (ARROW-4800), providing a cleaner API by minimizing
   output parameters.
2. Refactor normalized factory methods for each core object (ArrayData,
   ChunkedArray, Column, RecordBatch, Table)
    - Common naming, I suggest we stay with `Make`.
    - Common return type, `StatusOr<shared_ptr<T>>`
3. Refactor existing Make methods to use new methods but preserve original
   signature by losing error message, on top of marking them deprecated.
4. Mark non-validating constructors as deprecated and ideailly make every
"dangerous" constructor non-public.

We'd give 1-2 release for consumers to stop using the deprecated
methods/constructors.

François

Reply via email to