Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Andrew Lamb Thu, 27 May 2021 03:45:45 -0700

I am not opposed to a new repo.

However I believe that the largest barrier to the community really getting
their heads around / evaluating arrow2 is its sheer size. -92k +57k isn't
something I am likely to get my head in any level of detail until I
actively work with it for a while.


The best way to get community input, I think, is to start the process of
getting arrow2 into arrow-rs via PRs. While splitting it up into multiple
PRs somehow is likely not easy and would require lots more work, starting
to get this work into arrow-rs in smaller chunks would be the ideal outcome
in my opinion.

Therefore, I don't see any benefit to a new repo -- I think a branch in
arrow-rs (or a fork) would work just as well. But again, I am not opposed
to a new repo either.

Andrew

On Wed, May 26, 2021 at 3:47 AM Fernando Herrera <
fernando.j.herr...@gmail.com> wrote:

> Thanks Jorge for the update and the continuous development on a
> safer version of arrow.
>
> I would like to give my support for option 3 as well. IMHO it will give
> arrow2 the exposition it needs to be considered by a wider set of
> users. This exposition will open the possibility to receive more
> participation regarding missing features required to integrate
> arrow2 to datafusion and ballista.
>
> It will also give peace of mind to arrow users that arrow2 will
> follow the apache way, meaning that its development will be stable,
> supported and community driven. I have dealt with this issue myself,
> where I have suggested the use of arrow2 for new projects, only
> to be discarded because of the impression that it isn't supported
> by the apache community; even after seeing the advantages the
> project presents.
>
>
> Fernando
>
> On Wed, May 26, 2021 at 6:38 AM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Hi,
> >
> > I would like to offer an update on this. I continued to work heavily on
> > this hypothesis on separate repos [1, 2], as this required a ground-up
> > refactor.
> >
> > Following is the current status, and at the end some options that we can
> > think about.
> >
> > TL;DR: I would like to gauge the communities' interest in making arrow2
> and
> > parquet2 experimental repos in Apache Arrow. IMO they are safer, faster,
> > more maintainable and equally compatible with both the arrow spec and
> > parquet spec.
> >
> > # Specification and interoperability
> >
> > IPC:
> >
> > All integration tests that in 4.1.0 runs pass against
> apache/arrow@master.
> > Furthermore, it passes the following tests that 4.1.0 does not:
> > * big endian IPC producer and consumer
> > * decimal128
> > * interval
> > * nested_large_offsets
> >
> > FFI: All integration tests that 4.1.0 runs pass against pyarrow==4.0.
> >
> > Arrow-Flight: it is the same code; I am not sure the tests in 4.1.0 are
> > passing or are skipped.
> >
> > parquet: arrow2 tests against parquet files generated by pyarrow under
> > different configurations:
> > * physical and logical types
> > * page versions
> > * repetition levels
> > * dictionary encoding
> >
> > # Safety
> >
> > * arrow2 addresses all security vulnerabilities (all our +20 issues
> labeled
> > with "security" [3] and more currently not encapsulated in any issue) and
> > unsafety issues. In particular,
> > * all tests pass under MIRI checks
> > * all unsafe APIs are marked as unsafe
> > * parquet2 does not use unsafe
> >
> > # Maintenance
> >
> > * arrow + parquet has a total of 56k+36k LOC, excluding headers
> > * arrow2 + parquet2 has a total of 50k+7k LOC, excluding headers
> > * arrow2 coverage is 76%, arrow is 88%
> > * parquet2 is "unsafe"-free
> >
> > # Features
> >
> > Non-spec wise (e.g. compute, utils, parquet), the crate has about 90% of
> > all the features in 4.0. What is missing:
> > * nested read and write parquet (lists and structs)
> > * missing some new features since 4.0.0
> >
> > OTOH, it has the following additional features:
> >
> > * API to read CSV in parallel
> > * API to read parquet in parallel
> > * checked_X, saturating_X, overflowing_X operations (i.e. non-panic
> > versions of add, subtract, etc.)
> > * arithmetics ops over dates, timestamps and durations
> > * display for every logical type
> > * more cast operations
> > * merge-sort kernel
> > * vectorized hashing
> >
> > # Performance
> >
> > * 3-15x reading and writing parquet, and APIs to read them in parallel
> (see
> > [4])
> > * faster IPC read
> > * arithmetics and the like are about the same performance as arrow 4.0
> > compiled with SIMD (available in nightly), ~1.5x faster without SIMD and
> > nightly.
> > * Some kernels degrade by about 20% due to bound checks (e.g. boolean
> > "take" in arrow 4.1 allows out of bound reads and is thus faster).
> > * Sort and filter ~2x faster. See [5,6]
> >
> > # Interoperability with DataFusion
> >
> > I have an experimental PR [7] in DataFusion. The THPC1 yields the same
> > result and has about the same performance (datafusion can perform
> > out-of-bound reads from arrow...), without group bys, it is ~2x faster.
> >
> > # Process / Community
> >
> > This is a "big chunk of code" type of situation developed over an
> external
> > repo. I tried to keep folks informed of the status and what was being
> done;
> > the mono-repo at the time was really difficult to cope with. With this
> > said:
> >
> > * I proposed an experimental repo mechanism so that we can conduct this
> > type of activities (now merged [8])
> > * I am not merging PRs that introduce new major API, so that the
> community
> > can weight in
> >
> > # Licensing
> >
> > All code is licensed under MIT and Apache; contributors are required to
> > accept any of these as part of their contributions. I can work to handle
> > this part with the incubator / ASF ahead of any potential code movement.
> >
> > -----------------
> >
> > Main question: what do we do?
> >
> > Some ideas:
> >
> > 1. PR all this to the arrow-rs repo
> > 2. push this to a branch in arrow-rs
> > 3. move this to an experimental repo within ASF and work on it until we
> > have feature parity (e.g. read and write nested types to/from parquet),
> and
> > then apply 1 or 2
> > 4. do nothing
> >
> > Concerns with option 1:
> > * development will continue to happen outside ASF
> > * no easy way to collaborate: issue tracking for this code outside ASF
> > * no easy way for the community to weight in over changes to the API
> prior
> > to merge
> > Concerns with option 2:
> > * issue tracking about branches is confusing, specially in creating
> change
> > logs
> > * PRs to branches is confusing
> > * no easy way for the community to weight in over changes to the API
> prior
> > to merge
> > Concerns with option 3:
> > * it would be the first experimental repo, thus some risk
> > Concerns with option 4:
> > * for the time being this would continue to be a project independent of
> the
> > Apache Arrow
> > * I would release 0.1 to crate.io as "arrow-safe" or something, as there
> > is
> > demand for it.
> >
> > I would be in favor of option 3 for the following reason: I do not think
> it
> > is useful to PR a -92k +57k change without giving the community extensive
> > time to evaluate, contribute, and time for a proper discussion. In this
> > context, my idea here was to give some time for the ideas to mature
> within
> > ASF, and only then even consider a switch.
> >
> > Thanks,
> > Jorge
> >
> > [1] https://github.com/jorgecarleitao/arrow2
> > [2] https://github.com/jorgecarleitao/parquet2
> > [3]
> >
> >
> https://github.com/apache/arrow-rs/issues?q=is%3Aopen+is%3Aissue+label%3Asecurity
> > [4]
> >
> >
> https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045
> > [5]
> >
> >
> https://jorgecarleitao.medium.com/safe-analytics-with-rust-and-arrow-564f05107dd2
> > [6]
> >
> >
> https://docs.google.com/spreadsheets/d/1hLKsqJaw_VLjtJCgQ635R9iHDNYwZscE0OT1omdZuwg/edit#gid=402497043
> > [7] https://github.com/apache/arrow-datafusion/pull/68
> > [8] https://issues.apache.org/jira/browse/ARROW-12643
> >
> >
> > On Sun, Feb 7, 2021 at 2:42 PM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Over the past 4 months, I have been growing more and more frustrated by
> > > the amount of undefined behaviour that I am finding and fixing on the
> > Rust
> > > implementation. I would like to open the discussion of a broader
> overview
> > > about the problem in light of our current knowledge and what Rust
> enables
> > > as well as offer a solution to the bigger problem.
> > >
> > > Just to give you a gist of the seriousness of the issue, the following
> > > currently compiles, runs, and is undefined behavior in Rust:
> > >
> > > let buffer = Buffer::from(&[0i32, 2i32]);let data =
> > ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer], vec![]);let
> > array = Float64Array::from(Arc::new(data));
> > > println!("{:?}", array.value(1));
> > >
> > > I would like to propose a major refactor of the crate around physical
> > > traits, Buffer, MutableBuffer and ArrayData to make our code type-safe
> at
> > > compile time, thereby avoiding things like the example above from
> > happening
> > > again.
> > >
> > > So far, I was able to reproduce all core features of the arrow crate
> > > (nested types, dynamic typing, FFI, memory alignment, performance) by
> > using
> > > `Buffer<T: NativeType>` instead of `Buffer` and removing `ArrayData`
> and
> > > RawPointer altogether.
> > >
> > > Safety-wise, it significantly limits the usage of `unsafe` on higher
> end
> > > APIs, it has a single transmute (the bit chunk iterator one), and a
> > > guaranteed safe public API (which is not the case in our master, as
> shown
> > > above).
> > >
> > > Performance wise, it yields a 1.3x improvement over the current master
> > > (after this fix <https://github.com/apache/arrow/pull/9301> of UB on
> the
> > > take kernel, 1.7x prior to it) for the `take` kernel for primitives. I
> > > should have other major performance improvements.
> > >
> > > API wise, it simplifies the traits that we have for memory layout as
> well
> > > as the handling of bitmaps, offsets, etc.
> > >
> > > The proposal is drafted as a README
> > > <https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md> on
> a
> > > repo that I created specifically for this from the ground up, and the
> > full
> > > set of changes are in a PR
> > > <https://github.com/jorgecarleitao/arrow2/pull/1> so that anyone can
> > view
> > > and comment on it. I haven't made any PR to master because this is too
> > > large to track as a diff against master, and is beyond the point,
> > anyways.
> > >
> > > I haven't ported most of the crate as I only tried the non-trivial
> > > features (memory layout, bitmaps, FFI, dynamic typing, nested types).
> > >
> > > I would highly appreciate your thoughts about it.
> > >
> > > Best,
> > > Jorge
> > >
> > >
> >
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Reply via email to