Greetings, I noted that the current C++ API permits constructing core objects breaking said classes invariants. The following recent issues were affected by this:
- ARROW-4766: segfault due to invalid ArrayData with nullptr buffer - ARROW-4774: segfault due to invalid Table with columns of different length Consumers often assumes that the objects respect the invariants, e.g. by dereferencing `array_data->buffers[i]->data()` or `Array::GetValue(table.n_rows - 1)` . Sample of classes which requires invariant in the constructor: - ArrayData/Array: number and size of buffers depending on type - ChunkedArray: Arrays of same type - Column: same as ChunkedArray and Field must match array's type - RecordBatch: number of columns and schema must match, columns of same size - Table: columns must be of same size Some classes provide static factory methods, notably: - Most `shared_ptr<T> *::Make` methods, but they lack the Status return type to indicate failure, the consumer can at least check for nullptr - `Status Table::FromRecordBatches(..., shared_ptr<T> *out)` is a good candidate to follow I suspect that mis-usage is only going to grow with the number of users and language that binds to C++. I would like to propose a plan to tackle for the 0.14.0 release 1. Implement `StatusOr` (ARROW-4800), providing a cleaner API by minimizing output parameters. 2. Refactor normalized factory methods for each core object (ArrayData, ChunkedArray, Column, RecordBatch, Table) - Common naming, I suggest we stay with `Make`. - Common return type, `StatusOr<shared_ptr<T>>` 3. Refactor existing Make methods to use new methods but preserve original signature by losing error message, on top of marking them deprecated. 4. Mark non-validating constructors as deprecated and ideailly make every "dangerous" constructor non-public. We'd give 1-2 release for consumers to stop using the deprecated methods/constructors. François