Thanks Jorge for the update and the continuous development on a
safer version of arrow.

I would like to give my support for option 3 as well. IMHO it will give
arrow2 the exposition it needs to be considered by a wider set of
users. This exposition will open the possibility to receive more
participation regarding missing features required to integrate
arrow2 to datafusion and ballista.

It will also give peace of mind to arrow users that arrow2 will
follow the apache way, meaning that its development will be stable,
supported and community driven. I have dealt with this issue myself,
where I have suggested the use of arrow2 for new projects, only
to be discarded because of the impression that it isn't supported
by the apache community; even after seeing the advantages the
project presents.


Fernando

On Wed, May 26, 2021 at 6:38 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> I would like to offer an update on this. I continued to work heavily on
> this hypothesis on separate repos [1, 2], as this required a ground-up
> refactor.
>
> Following is the current status, and at the end some options that we can
> think about.
>
> TL;DR: I would like to gauge the communities' interest in making arrow2 and
> parquet2 experimental repos in Apache Arrow. IMO they are safer, faster,
> more maintainable and equally compatible with both the arrow spec and
> parquet spec.
>
> # Specification and interoperability
>
> IPC:
>
> All integration tests that in 4.1.0 runs pass against apache/arrow@master.
> Furthermore, it passes the following tests that 4.1.0 does not:
> * big endian IPC producer and consumer
> * decimal128
> * interval
> * nested_large_offsets
>
> FFI: All integration tests that 4.1.0 runs pass against pyarrow==4.0.
>
> Arrow-Flight: it is the same code; I am not sure the tests in 4.1.0 are
> passing or are skipped.
>
> parquet: arrow2 tests against parquet files generated by pyarrow under
> different configurations:
> * physical and logical types
> * page versions
> * repetition levels
> * dictionary encoding
>
> # Safety
>
> * arrow2 addresses all security vulnerabilities (all our +20 issues labeled
> with "security" [3] and more currently not encapsulated in any issue) and
> unsafety issues. In particular,
> * all tests pass under MIRI checks
> * all unsafe APIs are marked as unsafe
> * parquet2 does not use unsafe
>
> # Maintenance
>
> * arrow + parquet has a total of 56k+36k LOC, excluding headers
> * arrow2 + parquet2 has a total of 50k+7k LOC, excluding headers
> * arrow2 coverage is 76%, arrow is 88%
> * parquet2 is "unsafe"-free
>
> # Features
>
> Non-spec wise (e.g. compute, utils, parquet), the crate has about 90% of
> all the features in 4.0. What is missing:
> * nested read and write parquet (lists and structs)
> * missing some new features since 4.0.0
>
> OTOH, it has the following additional features:
>
> * API to read CSV in parallel
> * API to read parquet in parallel
> * checked_X, saturating_X, overflowing_X operations (i.e. non-panic
> versions of add, subtract, etc.)
> * arithmetics ops over dates, timestamps and durations
> * display for every logical type
> * more cast operations
> * merge-sort kernel
> * vectorized hashing
>
> # Performance
>
> * 3-15x reading and writing parquet, and APIs to read them in parallel (see
> [4])
> * faster IPC read
> * arithmetics and the like are about the same performance as arrow 4.0
> compiled with SIMD (available in nightly), ~1.5x faster without SIMD and
> nightly.
> * Some kernels degrade by about 20% due to bound checks (e.g. boolean
> "take" in arrow 4.1 allows out of bound reads and is thus faster).
> * Sort and filter ~2x faster. See [5,6]
>
> # Interoperability with DataFusion
>
> I have an experimental PR [7] in DataFusion. The THPC1 yields the same
> result and has about the same performance (datafusion can perform
> out-of-bound reads from arrow...), without group bys, it is ~2x faster.
>
> # Process / Community
>
> This is a "big chunk of code" type of situation developed over an external
> repo. I tried to keep folks informed of the status and what was being done;
> the mono-repo at the time was really difficult to cope with. With this
> said:
>
> * I proposed an experimental repo mechanism so that we can conduct this
> type of activities (now merged [8])
> * I am not merging PRs that introduce new major API, so that the community
> can weight in
>
> # Licensing
>
> All code is licensed under MIT and Apache; contributors are required to
> accept any of these as part of their contributions. I can work to handle
> this part with the incubator / ASF ahead of any potential code movement.
>
> -----------------
>
> Main question: what do we do?
>
> Some ideas:
>
> 1. PR all this to the arrow-rs repo
> 2. push this to a branch in arrow-rs
> 3. move this to an experimental repo within ASF and work on it until we
> have feature parity (e.g. read and write nested types to/from parquet), and
> then apply 1 or 2
> 4. do nothing
>
> Concerns with option 1:
> * development will continue to happen outside ASF
> * no easy way to collaborate: issue tracking for this code outside ASF
> * no easy way for the community to weight in over changes to the API prior
> to merge
> Concerns with option 2:
> * issue tracking about branches is confusing, specially in creating change
> logs
> * PRs to branches is confusing
> * no easy way for the community to weight in over changes to the API prior
> to merge
> Concerns with option 3:
> * it would be the first experimental repo, thus some risk
> Concerns with option 4:
> * for the time being this would continue to be a project independent of the
> Apache Arrow
> * I would release 0.1 to crate.io as "arrow-safe" or something, as there
> is
> demand for it.
>
> I would be in favor of option 3 for the following reason: I do not think it
> is useful to PR a -92k +57k change without giving the community extensive
> time to evaluate, contribute, and time for a proper discussion. In this
> context, my idea here was to give some time for the ideas to mature within
> ASF, and only then even consider a switch.
>
> Thanks,
> Jorge
>
> [1] https://github.com/jorgecarleitao/arrow2
> [2] https://github.com/jorgecarleitao/parquet2
> [3]
>
> https://github.com/apache/arrow-rs/issues?q=is%3Aopen+is%3Aissue+label%3Asecurity
> [4]
>
> https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045
> [5]
>
> https://jorgecarleitao.medium.com/safe-analytics-with-rust-and-arrow-564f05107dd2
> [6]
>
> https://docs.google.com/spreadsheets/d/1hLKsqJaw_VLjtJCgQ635R9iHDNYwZscE0OT1omdZuwg/edit#gid=402497043
> [7] https://github.com/apache/arrow-datafusion/pull/68
> [8] https://issues.apache.org/jira/browse/ARROW-12643
>
>
> On Sun, Feb 7, 2021 at 2:42 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Hi,
> >
> > Over the past 4 months, I have been growing more and more frustrated by
> > the amount of undefined behaviour that I am finding and fixing on the
> Rust
> > implementation. I would like to open the discussion of a broader overview
> > about the problem in light of our current knowledge and what Rust enables
> > as well as offer a solution to the bigger problem.
> >
> > Just to give you a gist of the seriousness of the issue, the following
> > currently compiles, runs, and is undefined behavior in Rust:
> >
> > let buffer = Buffer::from(&[0i32, 2i32]);let data =
> ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer], vec![]);let
> array = Float64Array::from(Arc::new(data));
> > println!("{:?}", array.value(1));
> >
> > I would like to propose a major refactor of the crate around physical
> > traits, Buffer, MutableBuffer and ArrayData to make our code type-safe at
> > compile time, thereby avoiding things like the example above from
> happening
> > again.
> >
> > So far, I was able to reproduce all core features of the arrow crate
> > (nested types, dynamic typing, FFI, memory alignment, performance) by
> using
> > `Buffer<T: NativeType>` instead of `Buffer` and removing `ArrayData` and
> > RawPointer altogether.
> >
> > Safety-wise, it significantly limits the usage of `unsafe` on higher end
> > APIs, it has a single transmute (the bit chunk iterator one), and a
> > guaranteed safe public API (which is not the case in our master, as shown
> > above).
> >
> > Performance wise, it yields a 1.3x improvement over the current master
> > (after this fix <https://github.com/apache/arrow/pull/9301> of UB on the
> > take kernel, 1.7x prior to it) for the `take` kernel for primitives. I
> > should have other major performance improvements.
> >
> > API wise, it simplifies the traits that we have for memory layout as well
> > as the handling of bitmaps, offsets, etc.
> >
> > The proposal is drafted as a README
> > <https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md> on a
> > repo that I created specifically for this from the ground up, and the
> full
> > set of changes are in a PR
> > <https://github.com/jorgecarleitao/arrow2/pull/1> so that anyone can
> view
> > and comment on it. I haven't made any PR to master because this is too
> > large to track as a diff against master, and is beyond the point,
> anyways.
> >
> > I haven't ported most of the crate as I only tried the non-trivial
> > features (memory layout, bitmaps, FFI, dynamic typing, nested types).
> >
> > I would highly appreciate your thoughts about it.
> >
> > Best,
> > Jorge
> >
> >
>

Reply via email to