I am not opposed to a new repo. However I believe that the largest barrier to the community really getting their heads around / evaluating arrow2 is its sheer size. -92k +57k isn't something I am likely to get my head in any level of detail until I actively work with it for a while.
The best way to get community input, I think, is to start the process of getting arrow2 into arrow-rs via PRs. While splitting it up into multiple PRs somehow is likely not easy and would require lots more work, starting to get this work into arrow-rs in smaller chunks would be the ideal outcome in my opinion. Therefore, I don't see any benefit to a new repo -- I think a branch in arrow-rs (or a fork) would work just as well. But again, I am not opposed to a new repo either. Andrew On Wed, May 26, 2021 at 3:47 AM Fernando Herrera < fernando.j.herr...@gmail.com> wrote: > Thanks Jorge for the update and the continuous development on a > safer version of arrow. > > I would like to give my support for option 3 as well. IMHO it will give > arrow2 the exposition it needs to be considered by a wider set of > users. This exposition will open the possibility to receive more > participation regarding missing features required to integrate > arrow2 to datafusion and ballista. > > It will also give peace of mind to arrow users that arrow2 will > follow the apache way, meaning that its development will be stable, > supported and community driven. I have dealt with this issue myself, > where I have suggested the use of arrow2 for new projects, only > to be discarded because of the impression that it isn't supported > by the apache community; even after seeing the advantages the > project presents. > > > Fernando > > On Wed, May 26, 2021 at 6:38 AM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > Hi, > > > > I would like to offer an update on this. I continued to work heavily on > > this hypothesis on separate repos [1, 2], as this required a ground-up > > refactor. > > > > Following is the current status, and at the end some options that we can > > think about. > > > > TL;DR: I would like to gauge the communities' interest in making arrow2 > and > > parquet2 experimental repos in Apache Arrow. IMO they are safer, faster, > > more maintainable and equally compatible with both the arrow spec and > > parquet spec. > > > > # Specification and interoperability > > > > IPC: > > > > All integration tests that in 4.1.0 runs pass against > apache/arrow@master. > > Furthermore, it passes the following tests that 4.1.0 does not: > > * big endian IPC producer and consumer > > * decimal128 > > * interval > > * nested_large_offsets > > > > FFI: All integration tests that 4.1.0 runs pass against pyarrow==4.0. > > > > Arrow-Flight: it is the same code; I am not sure the tests in 4.1.0 are > > passing or are skipped. > > > > parquet: arrow2 tests against parquet files generated by pyarrow under > > different configurations: > > * physical and logical types > > * page versions > > * repetition levels > > * dictionary encoding > > > > # Safety > > > > * arrow2 addresses all security vulnerabilities (all our +20 issues > labeled > > with "security" [3] and more currently not encapsulated in any issue) and > > unsafety issues. In particular, > > * all tests pass under MIRI checks > > * all unsafe APIs are marked as unsafe > > * parquet2 does not use unsafe > > > > # Maintenance > > > > * arrow + parquet has a total of 56k+36k LOC, excluding headers > > * arrow2 + parquet2 has a total of 50k+7k LOC, excluding headers > > * arrow2 coverage is 76%, arrow is 88% > > * parquet2 is "unsafe"-free > > > > # Features > > > > Non-spec wise (e.g. compute, utils, parquet), the crate has about 90% of > > all the features in 4.0. What is missing: > > * nested read and write parquet (lists and structs) > > * missing some new features since 4.0.0 > > > > OTOH, it has the following additional features: > > > > * API to read CSV in parallel > > * API to read parquet in parallel > > * checked_X, saturating_X, overflowing_X operations (i.e. non-panic > > versions of add, subtract, etc.) > > * arithmetics ops over dates, timestamps and durations > > * display for every logical type > > * more cast operations > > * merge-sort kernel > > * vectorized hashing > > > > # Performance > > > > * 3-15x reading and writing parquet, and APIs to read them in parallel > (see > > [4]) > > * faster IPC read > > * arithmetics and the like are about the same performance as arrow 4.0 > > compiled with SIMD (available in nightly), ~1.5x faster without SIMD and > > nightly. > > * Some kernels degrade by about 20% due to bound checks (e.g. boolean > > "take" in arrow 4.1 allows out of bound reads and is thus faster). > > * Sort and filter ~2x faster. See [5,6] > > > > # Interoperability with DataFusion > > > > I have an experimental PR [7] in DataFusion. The THPC1 yields the same > > result and has about the same performance (datafusion can perform > > out-of-bound reads from arrow...), without group bys, it is ~2x faster. > > > > # Process / Community > > > > This is a "big chunk of code" type of situation developed over an > external > > repo. I tried to keep folks informed of the status and what was being > done; > > the mono-repo at the time was really difficult to cope with. With this > > said: > > > > * I proposed an experimental repo mechanism so that we can conduct this > > type of activities (now merged [8]) > > * I am not merging PRs that introduce new major API, so that the > community > > can weight in > > > > # Licensing > > > > All code is licensed under MIT and Apache; contributors are required to > > accept any of these as part of their contributions. I can work to handle > > this part with the incubator / ASF ahead of any potential code movement. > > > > ----------------- > > > > Main question: what do we do? > > > > Some ideas: > > > > 1. PR all this to the arrow-rs repo > > 2. push this to a branch in arrow-rs > > 3. move this to an experimental repo within ASF and work on it until we > > have feature parity (e.g. read and write nested types to/from parquet), > and > > then apply 1 or 2 > > 4. do nothing > > > > Concerns with option 1: > > * development will continue to happen outside ASF > > * no easy way to collaborate: issue tracking for this code outside ASF > > * no easy way for the community to weight in over changes to the API > prior > > to merge > > Concerns with option 2: > > * issue tracking about branches is confusing, specially in creating > change > > logs > > * PRs to branches is confusing > > * no easy way for the community to weight in over changes to the API > prior > > to merge > > Concerns with option 3: > > * it would be the first experimental repo, thus some risk > > Concerns with option 4: > > * for the time being this would continue to be a project independent of > the > > Apache Arrow > > * I would release 0.1 to crate.io as "arrow-safe" or something, as there > > is > > demand for it. > > > > I would be in favor of option 3 for the following reason: I do not think > it > > is useful to PR a -92k +57k change without giving the community extensive > > time to evaluate, contribute, and time for a proper discussion. In this > > context, my idea here was to give some time for the ideas to mature > within > > ASF, and only then even consider a switch. > > > > Thanks, > > Jorge > > > > [1] https://github.com/jorgecarleitao/arrow2 > > [2] https://github.com/jorgecarleitao/parquet2 > > [3] > > > > > https://github.com/apache/arrow-rs/issues?q=is%3Aopen+is%3Aissue+label%3Asecurity > > [4] > > > > > https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045 > > [5] > > > > > https://jorgecarleitao.medium.com/safe-analytics-with-rust-and-arrow-564f05107dd2 > > [6] > > > > > https://docs.google.com/spreadsheets/d/1hLKsqJaw_VLjtJCgQ635R9iHDNYwZscE0OT1omdZuwg/edit#gid=402497043 > > [7] https://github.com/apache/arrow-datafusion/pull/68 > > [8] https://issues.apache.org/jira/browse/ARROW-12643 > > > > > > On Sun, Feb 7, 2021 at 2:42 PM Jorge Cardoso Leitão < > > jorgecarlei...@gmail.com> wrote: > > > > > Hi, > > > > > > Over the past 4 months, I have been growing more and more frustrated by > > > the amount of undefined behaviour that I am finding and fixing on the > > Rust > > > implementation. I would like to open the discussion of a broader > overview > > > about the problem in light of our current knowledge and what Rust > enables > > > as well as offer a solution to the bigger problem. > > > > > > Just to give you a gist of the seriousness of the issue, the following > > > currently compiles, runs, and is undefined behavior in Rust: > > > > > > let buffer = Buffer::from(&[0i32, 2i32]);let data = > > ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer], vec![]);let > > array = Float64Array::from(Arc::new(data)); > > > println!("{:?}", array.value(1)); > > > > > > I would like to propose a major refactor of the crate around physical > > > traits, Buffer, MutableBuffer and ArrayData to make our code type-safe > at > > > compile time, thereby avoiding things like the example above from > > happening > > > again. > > > > > > So far, I was able to reproduce all core features of the arrow crate > > > (nested types, dynamic typing, FFI, memory alignment, performance) by > > using > > > `Buffer<T: NativeType>` instead of `Buffer` and removing `ArrayData` > and > > > RawPointer altogether. > > > > > > Safety-wise, it significantly limits the usage of `unsafe` on higher > end > > > APIs, it has a single transmute (the bit chunk iterator one), and a > > > guaranteed safe public API (which is not the case in our master, as > shown > > > above). > > > > > > Performance wise, it yields a 1.3x improvement over the current master > > > (after this fix <https://github.com/apache/arrow/pull/9301> of UB on > the > > > take kernel, 1.7x prior to it) for the `take` kernel for primitives. I > > > should have other major performance improvements. > > > > > > API wise, it simplifies the traits that we have for memory layout as > well > > > as the handling of bitmaps, offsets, etc. > > > > > > The proposal is drafted as a README > > > <https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md> on > a > > > repo that I created specifically for this from the ground up, and the > > full > > > set of changes are in a PR > > > <https://github.com/jorgecarleitao/arrow2/pull/1> so that anyone can > > view > > > and comment on it. I haven't made any PR to master because this is too > > > large to track as a diff against master, and is beyond the point, > > anyways. > > > > > > I haven't ported most of the crate as I only tried the non-trivial > > > features (memory layout, bitmaps, FFI, dynamic typing, nested types). > > > > > > I would highly appreciate your thoughts about it. > > > > > > Best, > > > Jorge > > > > > > > > >