Thanks, Wes.

"By the way, we need more help with systematic and automated
benchmarking so we can use commit-by-commit numbers to assist in our
decision making."

I can certainly help with that.

Pretty stretched right now, but I will start thinking about it, maybe
adding some design thoughts.

I will start with the JIRA, and if you can, please add your thoughts to it,
what you think should be benchmarked and how.

On Sun, Mar 10, 2019 at 11:20 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Edmon,
>
> Here's an example of a function that does some schema validation:
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/table.cc#L450
>
> The issue is less about the magnitude of the cost and more of a
> software engineering question about layering of concerns. Consider two
> code paths:
>
> * Path A: Eventually constructs tables from input known to be valid
> (validated by some other procedure)
> * Path B: Constructs tables for user input
>
> I don't believe that Path A should have to pay for the additional
> validation requirements that Path B has. Overly defensive coding
> practices can also create a false sense of security re: unit testing
> edge cases.
>
> By the way, we need more help with systematic and automated
> benchmarking so we can use commit-by-commit numbers to assist in our
> decision making.
>
> - Wes
>
> On Sun, Mar 10, 2019 at 6:29 PM Edmon Begoli <ebeg...@berkeley.edu> wrote:
> >
> > Do you guys have an example somewhere of this validated vs. unvalidated
> > code, and suspected performance impacts, and has anyone benchmarked any
> of
> > this?
> >
> >
> >
> > On Sun, Mar 10, 2019 at 5:45 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > I think having consistent methods for both validated and unvalidated
> > > construction is a good idea. Being fairly passionate about
> > > microperformance, I don't think we should penalize responsible users
> > > of unsafe/unvalidated APIs (e.g. by taking them away and replacing
> > > them with variants featuring unavoidable computation), this can
> > > partially be handled through developer documentation (which, ah, we
> > > will need to write more of)
> > >
> > > On Sun, Mar 10, 2019 at 4:01 PM Micah Kornfield <emkornfi...@gmail.com
> >
> > > wrote:
> > > >
> > > > I agree there should always be a path to avoid the validation but I
> > > think there should also be an easy way to have validation included and
> a
> > > clear way to tell the difference.  IMO, having strong naming
> convention so
> > > callers can tell the difference, and code reviewers can focus more on
> less
> > > safe method usage, is important.  I will help new-comers to the project
> > > write safer code.  Which can either be refactored or called out in code
> > > review for performance issues.  It also provides a cue for all
> developers
> > > to consider if they are meeting the necessary requirements when using
> less
> > > safe methods.
> > > >
> > > > A straw-man proposal for naming conventions:
> > > > - Constructors are always unvalidated (should still have appropriate
> > > DCHECKs)
> > > > - "Make" calls are always unvalidated (should still have appropriate
> > > DHCECKs)
> > > > - "MakeValidated" ensures proper structural validation occur (but not
> > > data is validation).
> > > > - "MakeSanitized" ensures proper structural and data is validations
> occur
> > > >
> > > > As noted above it might only pay to refactor a small amount of
> current
> > > usage to the safer APIs.
> > > >
> > > > We could potentially go even further down the rabbit hole and try to
> > > define standard for a Hungarian notation [1] to make it more obvious
> what
> > > invariants are expected for a particular data-structure variable (I'm
> > > actually -.5 on this).
> > > >
> > > > As a personal bias, I would rather have slower code that has lower
> risk
> > > of crashing in production than faster code that does.  Obviously,
> there is
> > > a tradeoff here, and the ideal is faster code that won't segfault.
> > > >
> > > > Thoughts?
> > > >
> > > > -Micah
> > > >
> > > > [1]
> > >
> https://www.joelonsoftware.com/2005/05/11/making-wrong-code-look-wrong/
> > > >
> > > > On Sun, Mar 10, 2019 at 9:38 AM Wes McKinney <wesmck...@gmail.com>
> > > wrote:
> > > >>
> > > >> hi folks,
> > > >>
> > > >> I think some issues are being conflated here, so let me try to dig
> > > >> through them. Let's first look at the two cited bugs that were
> fixed,
> > > >> if I have this right:
> > > >>
> > > >> * ARROW-4766: root cause dereferencing a null pointer
> > > >> * ARROW-4774: root cause unsanitized Python user input
> > > >>
> > > >> None of the 4 remedies listed could have prevented ARROW-4766 AFAICT
> > > >> since we currently allow for null buffers (the object, not the
> pointer
> > > >> inside) in ArrayData. This has been discussed on the mailing list in
> > > >> the past; "sanitizing" ArrayData to be free of null objects would be
> > > >> expensive and my general attitude in the C++ library is that we
> should
> > > >> not be in the business of paying extra CPU cycles for the 1-5% case
> > > >> when it is unneeded in the 95-99% of cases. We have DCHECK
> assertions
> > > >> to check these issues in debug builds while avoiding the costs in
> > > >> release builds. In the case of checking edge cases in computational
> > > >> kernels, suffice to say that we should check every kernel on
> length-0
> > > >> input with null buffers to make sure this case is properly handled
> > > >>
> > > >> In the case of ARROW-4774, we should work at the language binding
> > > >> interface to make sure we have convenient "validating" constructors
> > > >> that check user input for common problems. This can prevent the
> > > >> duplication of this code across the binding layers (GLib, Python, R,
> > > >> MATLAB, etc.)
> > > >>
> > > >>  On the specific 4 steps mentioned by Francois, here are my
> thoughts:
> > > >>
> > > >> 1. Having StatusOr would be useful, but this is a programming
> > > convenience
> > > >>
> > > >> 2. There are a couple purposes of the static factory methods that
> > > >> exist now, like Table::Make and RecordBatch::Make. One of the
> reasons
> > > >> that I added them initially was because of the implicit constructor
> > > >> behavior of std::vector inside a call to std::make_shared. If you
> have
> > > >> a std::vector<T> argument in a class's constructor, then
> > > >> std::make_shared<Klass>(..., {...}) will not result in the
> initializer
> > > >> list constructing the std::vector<T>. This means some awkwardness
> like
> > > >> having to assign a std::vector<T> lvalue and _then_ pass that to
> > > >> std::make_shared<Klass>(..., vector_arg, ...).
> > > >>
> > > >> I do not agree with refactoring these methods to use "validating"
> > > >> constructors. Users of these C++ APIs should know what their
> > > >> requirements are, and we provide in some cases a Validate() to spend
> > > >> the extra cycles to assert preconditions. Therefore:
> > > >>
> > > >> 3. -1 on this
> > > >> 4. -1 also
> > > >>
> > > >> Thanks
> > > >> Wes
> > > >>
> > > >> On Sat, Mar 9, 2019 at 9:58 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > wrote:
> > > >> >
> > > >> > HI François,
> > > >> > This sounds great.  I would hope that as part of this we document
> the
> > > >> > invariants (and possible sharp edges like zero length/all null
> and no
> > > null
> > > >> > Arrays).
> > > >> >
> > > >> > Is your intent to allow languages binding to the C++ library go
> > > through the
> > > >> > new API or will they still be able to use the "dangerous" ones?
> > > >> >
> > > >> > -Micah
> > > >> >
> > > >> > On Fri, Mar 8, 2019 at 6:16 PM Francois Saint-Jacques <
> > > >> > fsaintjacq...@gmail.com> wrote:
> > > >> >
> > > >> > > Greetings,
> > > >> > >
> > > >> > > I noted that the current C++ API permits constructing core
> objects
> > > breaking
> > > >> > > said classes invariants. The following recent issues were
> affected
> > > by this:
> > > >> > >
> > > >> > > - ARROW-4766: segfault due to invalid ArrayData with nullptr
> buffer
> > > >> > > - ARROW-4774: segfault due to invalid Table with columns of
> > > different
> > > >> > > length
> > > >> > >
> > > >> > > Consumers often assumes that the objects respect the invariants,
> > > e.g. by
> > > >> > > dereferencing `array_data->buffers[i]->data()` or
> > > >> > > `Array::GetValue(table.n_rows - 1)` .
> > > >> > >
> > > >> > > Sample of classes which requires invariant in the constructor:
> > > >> > > - ArrayData/Array: number and size of buffers depending on type
> > > >> > > - ChunkedArray: Arrays of same type
> > > >> > > - Column: same as ChunkedArray and Field must match array's type
> > > >> > > - RecordBatch: number of columns and schema must match, columns
> of
> > > same
> > > >> > > size
> > > >> > > - Table: columns must be of same size
> > > >> > >
> > > >> > > Some classes provide static factory methods, notably:
> > > >> > > - Most `shared_ptr<T> *::Make` methods, but they lack the Status
> > > return
> > > >> > > type to indicate
> > > >> > >   failure, the consumer can at least check for nullptr
> > > >> > > - `Status Table::FromRecordBatches(..., shared_ptr<T> *out)` is
> a
> > > good
> > > >> > > candidate to follow
> > > >> > >
> > > >> > > I suspect that mis-usage is only going to grow with the number
> of
> > > users and
> > > >> > > language that binds to C++. I would like to propose a plan to
> > > tackle for
> > > >> > > the
> > > >> > > 0.14.0 release
> > > >> > >
> > > >> > > 1. Implement `StatusOr` (ARROW-4800), providing a cleaner API by
> > > minimizing
> > > >> > >    output parameters.
> > > >> > > 2. Refactor normalized factory methods for each core object
> > > (ArrayData,
> > > >> > >    ChunkedArray, Column, RecordBatch, Table)
> > > >> > >     - Common naming, I suggest we stay with `Make`.
> > > >> > >     - Common return type, `StatusOr<shared_ptr<T>>`
> > > >> > > 3. Refactor existing Make methods to use new methods but
> preserve
> > > original
> > > >> > >    signature by losing error message, on top of marking them
> > > deprecated.
> > > >> > > 4. Mark non-validating constructors as deprecated and ideailly
> make
> > > every
> > > >> > > "dangerous" constructor non-public.
> > > >> > >
> > > >> > > We'd give 1-2 release for consumers to stop using the deprecated
> > > >> > > methods/constructors.
> > > >> > >
> > > >> > > François
> > > >> > >
> > >
>

Reply via email to