Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Wes McKinney Thu, 27 May 2021 05:58:33 -0700

I think given the size and scope of the work, there's a stronger
argument for having an IP clearance for this code (as compared with
python-datafusion).


On Thu, May 27, 2021 at 5:45 AM Andrew Lamb <al...@influxdata.com> wrote:
>
> I am not opposed to a new repo.
>
> However I believe that the largest barrier to the community really getting
> their heads around / evaluating arrow2 is its sheer size. -92k +57k isn't
> something I am likely to get my head in any level of detail until I
> actively work with it for a while.
>
> The best way to get community input, I think, is to start the process of
> getting arrow2 into arrow-rs via PRs. While splitting it up into multiple
> PRs somehow is likely not easy and would require lots more work, starting
> to get this work into arrow-rs in smaller chunks would be the ideal outcome
> in my opinion.
>
> Therefore, I don't see any benefit to a new repo -- I think a branch in
> arrow-rs (or a fork) would work just as well. But again, I am not opposed
> to a new repo either.
>
> Andrew
>
> On Wed, May 26, 2021 at 3:47 AM Fernando Herrera <
> fernando.j.herr...@gmail.com> wrote:
>
> > Thanks Jorge for the update and the continuous development on a
> > safer version of arrow.
> >
> > I would like to give my support for option 3 as well. IMHO it will give
> > arrow2 the exposition it needs to be considered by a wider set of
> > users. This exposition will open the possibility to receive more
> > participation regarding missing features required to integrate
> > arrow2 to datafusion and ballista.
> >
> > It will also give peace of mind to arrow users that arrow2 will
> > follow the apache way, meaning that its development will be stable,
> > supported and community driven. I have dealt with this issue myself,
> > where I have suggested the use of arrow2 for new projects, only
> > to be discarded because of the impression that it isn't supported
> > by the apache community; even after seeing the advantages the
> > project presents.
> >
> >
> > Fernando
> >
> > On Wed, May 26, 2021 at 6:38 AM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I would like to offer an update on this. I continued to work heavily on
> > > this hypothesis on separate repos [1, 2], as this required a ground-up
> > > refactor.
> > >
> > > Following is the current status, and at the end some options that we can
> > > think about.
> > >
> > > TL;DR: I would like to gauge the communities' interest in making arrow2
> > and
> > > parquet2 experimental repos in Apache Arrow. IMO they are safer, faster,
> > > more maintainable and equally compatible with both the arrow spec and
> > > parquet spec.
> > >
> > > # Specification and interoperability
> > >
> > > IPC:
> > >
> > > All integration tests that in 4.1.0 runs pass against
> > apache/arrow@master.
> > > Furthermore, it passes the following tests that 4.1.0 does not:
> > > * big endian IPC producer and consumer
> > > * decimal128
> > > * interval
> > > * nested_large_offsets
> > >
> > > FFI: All integration tests that 4.1.0 runs pass against pyarrow==4.0.
> > >
> > > Arrow-Flight: it is the same code; I am not sure the tests in 4.1.0 are
> > > passing or are skipped.
> > >
> > > parquet: arrow2 tests against parquet files generated by pyarrow under
> > > different configurations:
> > > * physical and logical types
> > > * page versions
> > > * repetition levels
> > > * dictionary encoding
> > >
> > > # Safety
> > >
> > > * arrow2 addresses all security vulnerabilities (all our +20 issues
> > labeled
> > > with "security" [3] and more currently not encapsulated in any issue) and
> > > unsafety issues. In particular,
> > > * all tests pass under MIRI checks
> > > * all unsafe APIs are marked as unsafe
> > > * parquet2 does not use unsafe
> > >
> > > # Maintenance
> > >
> > > * arrow + parquet has a total of 56k+36k LOC, excluding headers
> > > * arrow2 + parquet2 has a total of 50k+7k LOC, excluding headers
> > > * arrow2 coverage is 76%, arrow is 88%
> > > * parquet2 is "unsafe"-free
> > >
> > > # Features
> > >
> > > Non-spec wise (e.g. compute, utils, parquet), the crate has about 90% of
> > > all the features in 4.0. What is missing:
> > > * nested read and write parquet (lists and structs)
> > > * missing some new features since 4.0.0
> > >
> > > OTOH, it has the following additional features:
> > >
> > > * API to read CSV in parallel
> > > * API to read parquet in parallel
> > > * checked_X, saturating_X, overflowing_X operations (i.e. non-panic
> > > versions of add, subtract, etc.)
> > > * arithmetics ops over dates, timestamps and durations
> > > * display for every logical type
> > > * more cast operations
> > > * merge-sort kernel
> > > * vectorized hashing
> > >
> > > # Performance
> > >
> > > * 3-15x reading and writing parquet, and APIs to read them in parallel
> > (see
> > > [4])
> > > * faster IPC read
> > > * arithmetics and the like are about the same performance as arrow 4.0
> > > compiled with SIMD (available in nightly), ~1.5x faster without SIMD and
> > > nightly.
> > > * Some kernels degrade by about 20% due to bound checks (e.g. boolean
> > > "take" in arrow 4.1 allows out of bound reads and is thus faster).
> > > * Sort and filter ~2x faster. See [5,6]
> > >
> > > # Interoperability with DataFusion
> > >
> > > I have an experimental PR [7] in DataFusion. The THPC1 yields the same
> > > result and has about the same performance (datafusion can perform
> > > out-of-bound reads from arrow...), without group bys, it is ~2x faster.
> > >
> > > # Process / Community
> > >
> > > This is a "big chunk of code" type of situation developed over an
> > external
> > > repo. I tried to keep folks informed of the status and what was being
> > done;
> > > the mono-repo at the time was really difficult to cope with. With this
> > > said:
> > >
> > > * I proposed an experimental repo mechanism so that we can conduct this
> > > type of activities (now merged [8])
> > > * I am not merging PRs that introduce new major API, so that the
> > community
> > > can weight in
> > >
> > > # Licensing
> > >
> > > All code is licensed under MIT and Apache; contributors are required to
> > > accept any of these as part of their contributions. I can work to handle
> > > this part with the incubator / ASF ahead of any potential code movement.
> > >
> > > -----------------
> > >
> > > Main question: what do we do?
> > >
> > > Some ideas:
> > >
> > > 1. PR all this to the arrow-rs repo
> > > 2. push this to a branch in arrow-rs
> > > 3. move this to an experimental repo within ASF and work on it until we
> > > have feature parity (e.g. read and write nested types to/from parquet),
> > and
> > > then apply 1 or 2
> > > 4. do nothing
> > >
> > > Concerns with option 1:
> > > * development will continue to happen outside ASF
> > > * no easy way to collaborate: issue tracking for this code outside ASF
> > > * no easy way for the community to weight in over changes to the API
> > prior
> > > to merge
> > > Concerns with option 2:
> > > * issue tracking about branches is confusing, specially in creating
> > change
> > > logs
> > > * PRs to branches is confusing
> > > * no easy way for the community to weight in over changes to the API
> > prior
> > > to merge
> > > Concerns with option 3:
> > > * it would be the first experimental repo, thus some risk
> > > Concerns with option 4:
> > > * for the time being this would continue to be a project independent of
> > the
> > > Apache Arrow
> > > * I would release 0.1 to crate.io as "arrow-safe" or something, as there
> > > is
> > > demand for it.
> > >
> > > I would be in favor of option 3 for the following reason: I do not think
> > it
> > > is useful to PR a -92k +57k change without giving the community extensive
> > > time to evaluate, contribute, and time for a proper discussion. In this
> > > context, my idea here was to give some time for the ideas to mature
> > within
> > > ASF, and only then even consider a switch.
> > >
> > > Thanks,
> > > Jorge
> > >
> > > [1] https://github.com/jorgecarleitao/arrow2
> > > [2] https://github.com/jorgecarleitao/parquet2
> > > [3]
> > >
> > >
> > https://github.com/apache/arrow-rs/issues?q=is%3Aopen+is%3Aissue+label%3Asecurity
> > > [4]
> > >
> > >
> > https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045
> > > [5]
> > >
> > >
> > https://jorgecarleitao.medium.com/safe-analytics-with-rust-and-arrow-564f05107dd2
> > > [6]
> > >
> > >
> > https://docs.google.com/spreadsheets/d/1hLKsqJaw_VLjtJCgQ635R9iHDNYwZscE0OT1omdZuwg/edit#gid=402497043
> > > [7] https://github.com/apache/arrow-datafusion/pull/68
> > > [8] https://issues.apache.org/jira/browse/ARROW-12643
> > >
> > >
> > > On Sun, Feb 7, 2021 at 2:42 PM Jorge Cardoso Leitão <
> > > jorgecarlei...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Over the past 4 months, I have been growing more and more frustrated by
> > > > the amount of undefined behaviour that I am finding and fixing on the
> > > Rust
> > > > implementation. I would like to open the discussion of a broader
> > overview
> > > > about the problem in light of our current knowledge and what Rust
> > enables
> > > > as well as offer a solution to the bigger problem.
> > > >
> > > > Just to give you a gist of the seriousness of the issue, the following
> > > > currently compiles, runs, and is undefined behavior in Rust:
> > > >
> > > > let buffer = Buffer::from(&[0i32, 2i32]);let data =
> > > ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer], vec![]);let
> > > array = Float64Array::from(Arc::new(data));
> > > > println!("{:?}", array.value(1));
> > > >
> > > > I would like to propose a major refactor of the crate around physical
> > > > traits, Buffer, MutableBuffer and ArrayData to make our code type-safe
> > at
> > > > compile time, thereby avoiding things like the example above from
> > > happening
> > > > again.
> > > >
> > > > So far, I was able to reproduce all core features of the arrow crate
> > > > (nested types, dynamic typing, FFI, memory alignment, performance) by
> > > using
> > > > `Buffer<T: NativeType>` instead of `Buffer` and removing `ArrayData`
> > and
> > > > RawPointer altogether.
> > > >
> > > > Safety-wise, it significantly limits the usage of `unsafe` on higher
> > end
> > > > APIs, it has a single transmute (the bit chunk iterator one), and a
> > > > guaranteed safe public API (which is not the case in our master, as
> > shown
> > > > above).
> > > >
> > > > Performance wise, it yields a 1.3x improvement over the current master
> > > > (after this fix <https://github.com/apache/arrow/pull/9301> of UB on
> > the
> > > > take kernel, 1.7x prior to it) for the `take` kernel for primitives. I
> > > > should have other major performance improvements.
> > > >
> > > > API wise, it simplifies the traits that we have for memory layout as
> > well
> > > > as the handling of bitmaps, offsets, etc.
> > > >
> > > > The proposal is drafted as a README
> > > > <https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md> on
> > a
> > > > repo that I created specifically for this from the ground up, and the
> > > full
> > > > set of changes are in a PR
> > > > <https://github.com/jorgecarleitao/arrow2/pull/1> so that anyone can
> > > view
> > > > and comment on it. I haven't made any PR to master because this is too
> > > > large to track as a diff against master, and is beyond the point,
> > > anyways.
> > > >
> > > > I haven't ported most of the crate as I only tried the non-trivial
> > > > features (memory layout, bitmaps, FFI, dynamic typing, nested types).
> > > >
> > > > I would highly appreciate your thoughts about it.
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > >
> > >
> >

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Reply via email to