Greetings,
I noted that the current C++ API permits constructing core objects breaking
said classes invariants. The following recent issues were affected by this:
- ARROW-4766: segfault due to invalid ArrayData with nullptr buffer
- ARROW-4774: segfault due to invalid Table with columns of different length
Consumers often assumes that the objects respect the invariants, e.g. by
dereferencing `array_data->buffers[i]->data()` or
`Array::GetValue(table.n_rows - 1)` .
Sample of classes which requires invariant in the constructor:
- ArrayData/Array: number and size of buffers depending on type
- ChunkedArray: Arrays of same type
- Column: same as ChunkedArray and Field must match array's type
- RecordBatch: number of columns and schema must match, columns of same size
- Table: columns must be of same size
Some classes provide static factory methods, notably:
- Most `shared_ptr<T> *::Make` methods, but they lack the Status return
type to indicate
failure, the consumer can at least check for nullptr
- `Status Table::FromRecordBatches(..., shared_ptr<T> *out)` is a good
candidate to follow
I suspect that mis-usage is only going to grow with the number of users and
language that binds to C++. I would like to propose a plan to tackle for the
0.14.0 release
1. Implement `StatusOr` (ARROW-4800), providing a cleaner API by minimizing
output parameters.
2. Refactor normalized factory methods for each core object (ArrayData,
ChunkedArray, Column, RecordBatch, Table)
- Common naming, I suggest we stay with `Make`.
- Common return type, `StatusOr<shared_ptr<T>>`
3. Refactor existing Make methods to use new methods but preserve original
signature by losing error message, on top of marking them deprecated.
4. Mark non-validating constructors as deprecated and ideailly make every
"dangerous" constructor non-public.
We'd give 1-2 release for consumers to stop using the deprecated
methods/constructors.
François