Do you guys have an example somewhere of this validated vs. unvalidated code, and suspected performance impacts, and has anyone benchmarked any of this?
On Sun, Mar 10, 2019 at 5:45 PM Wes McKinney <wesmck...@gmail.com> wrote: > I think having consistent methods for both validated and unvalidated > construction is a good idea. Being fairly passionate about > microperformance, I don't think we should penalize responsible users > of unsafe/unvalidated APIs (e.g. by taking them away and replacing > them with variants featuring unavoidable computation), this can > partially be handled through developer documentation (which, ah, we > will need to write more of) > > On Sun, Mar 10, 2019 at 4:01 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > I agree there should always be a path to avoid the validation but I > think there should also be an easy way to have validation included and a > clear way to tell the difference. IMO, having strong naming convention so > callers can tell the difference, and code reviewers can focus more on less > safe method usage, is important. I will help new-comers to the project > write safer code. Which can either be refactored or called out in code > review for performance issues. It also provides a cue for all developers > to consider if they are meeting the necessary requirements when using less > safe methods. > > > > A straw-man proposal for naming conventions: > > - Constructors are always unvalidated (should still have appropriate > DCHECKs) > > - "Make" calls are always unvalidated (should still have appropriate > DHCECKs) > > - "MakeValidated" ensures proper structural validation occur (but not > data is validation). > > - "MakeSanitized" ensures proper structural and data is validations occur > > > > As noted above it might only pay to refactor a small amount of current > usage to the safer APIs. > > > > We could potentially go even further down the rabbit hole and try to > define standard for a Hungarian notation [1] to make it more obvious what > invariants are expected for a particular data-structure variable (I'm > actually -.5 on this). > > > > As a personal bias, I would rather have slower code that has lower risk > of crashing in production than faster code that does. Obviously, there is > a tradeoff here, and the ideal is faster code that won't segfault. > > > > Thoughts? > > > > -Micah > > > > [1] > https://www.joelonsoftware.com/2005/05/11/making-wrong-code-look-wrong/ > > > > On Sun, Mar 10, 2019 at 9:38 AM Wes McKinney <wesmck...@gmail.com> > wrote: > >> > >> hi folks, > >> > >> I think some issues are being conflated here, so let me try to dig > >> through them. Let's first look at the two cited bugs that were fixed, > >> if I have this right: > >> > >> * ARROW-4766: root cause dereferencing a null pointer > >> * ARROW-4774: root cause unsanitized Python user input > >> > >> None of the 4 remedies listed could have prevented ARROW-4766 AFAICT > >> since we currently allow for null buffers (the object, not the pointer > >> inside) in ArrayData. This has been discussed on the mailing list in > >> the past; "sanitizing" ArrayData to be free of null objects would be > >> expensive and my general attitude in the C++ library is that we should > >> not be in the business of paying extra CPU cycles for the 1-5% case > >> when it is unneeded in the 95-99% of cases. We have DCHECK assertions > >> to check these issues in debug builds while avoiding the costs in > >> release builds. In the case of checking edge cases in computational > >> kernels, suffice to say that we should check every kernel on length-0 > >> input with null buffers to make sure this case is properly handled > >> > >> In the case of ARROW-4774, we should work at the language binding > >> interface to make sure we have convenient "validating" constructors > >> that check user input for common problems. This can prevent the > >> duplication of this code across the binding layers (GLib, Python, R, > >> MATLAB, etc.) > >> > >> On the specific 4 steps mentioned by Francois, here are my thoughts: > >> > >> 1. Having StatusOr would be useful, but this is a programming > convenience > >> > >> 2. There are a couple purposes of the static factory methods that > >> exist now, like Table::Make and RecordBatch::Make. One of the reasons > >> that I added them initially was because of the implicit constructor > >> behavior of std::vector inside a call to std::make_shared. If you have > >> a std::vector<T> argument in a class's constructor, then > >> std::make_shared<Klass>(..., {...}) will not result in the initializer > >> list constructing the std::vector<T>. This means some awkwardness like > >> having to assign a std::vector<T> lvalue and _then_ pass that to > >> std::make_shared<Klass>(..., vector_arg, ...). > >> > >> I do not agree with refactoring these methods to use "validating" > >> constructors. Users of these C++ APIs should know what their > >> requirements are, and we provide in some cases a Validate() to spend > >> the extra cycles to assert preconditions. Therefore: > >> > >> 3. -1 on this > >> 4. -1 also > >> > >> Thanks > >> Wes > >> > >> On Sat, Mar 9, 2019 at 9:58 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> > > >> > HI François, > >> > This sounds great. I would hope that as part of this we document the > >> > invariants (and possible sharp edges like zero length/all null and no > null > >> > Arrays). > >> > > >> > Is your intent to allow languages binding to the C++ library go > through the > >> > new API or will they still be able to use the "dangerous" ones? > >> > > >> > -Micah > >> > > >> > On Fri, Mar 8, 2019 at 6:16 PM Francois Saint-Jacques < > >> > fsaintjacq...@gmail.com> wrote: > >> > > >> > > Greetings, > >> > > > >> > > I noted that the current C++ API permits constructing core objects > breaking > >> > > said classes invariants. The following recent issues were affected > by this: > >> > > > >> > > - ARROW-4766: segfault due to invalid ArrayData with nullptr buffer > >> > > - ARROW-4774: segfault due to invalid Table with columns of > different > >> > > length > >> > > > >> > > Consumers often assumes that the objects respect the invariants, > e.g. by > >> > > dereferencing `array_data->buffers[i]->data()` or > >> > > `Array::GetValue(table.n_rows - 1)` . > >> > > > >> > > Sample of classes which requires invariant in the constructor: > >> > > - ArrayData/Array: number and size of buffers depending on type > >> > > - ChunkedArray: Arrays of same type > >> > > - Column: same as ChunkedArray and Field must match array's type > >> > > - RecordBatch: number of columns and schema must match, columns of > same > >> > > size > >> > > - Table: columns must be of same size > >> > > > >> > > Some classes provide static factory methods, notably: > >> > > - Most `shared_ptr<T> *::Make` methods, but they lack the Status > return > >> > > type to indicate > >> > > failure, the consumer can at least check for nullptr > >> > > - `Status Table::FromRecordBatches(..., shared_ptr<T> *out)` is a > good > >> > > candidate to follow > >> > > > >> > > I suspect that mis-usage is only going to grow with the number of > users and > >> > > language that binds to C++. I would like to propose a plan to > tackle for > >> > > the > >> > > 0.14.0 release > >> > > > >> > > 1. Implement `StatusOr` (ARROW-4800), providing a cleaner API by > minimizing > >> > > output parameters. > >> > > 2. Refactor normalized factory methods for each core object > (ArrayData, > >> > > ChunkedArray, Column, RecordBatch, Table) > >> > > - Common naming, I suggest we stay with `Make`. > >> > > - Common return type, `StatusOr<shared_ptr<T>>` > >> > > 3. Refactor existing Make methods to use new methods but preserve > original > >> > > signature by losing error message, on top of marking them > deprecated. > >> > > 4. Mark non-validating constructors as deprecated and ideailly make > every > >> > > "dangerous" constructor non-public. > >> > > > >> > > We'd give 1-2 release for consumers to stop using the deprecated > >> > > methods/constructors. > >> > > > >> > > François > >> > > >