I think given the size and scope of the work, there's a stronger argument for having an IP clearance for this code (as compared with python-datafusion).
On Thu, May 27, 2021 at 5:45 AM Andrew Lamb <al...@influxdata.com> wrote: > > I am not opposed to a new repo. > > However I believe that the largest barrier to the community really getting > their heads around / evaluating arrow2 is its sheer size. -92k +57k isn't > something I am likely to get my head in any level of detail until I > actively work with it for a while. > > The best way to get community input, I think, is to start the process of > getting arrow2 into arrow-rs via PRs. While splitting it up into multiple > PRs somehow is likely not easy and would require lots more work, starting > to get this work into arrow-rs in smaller chunks would be the ideal outcome > in my opinion. > > Therefore, I don't see any benefit to a new repo -- I think a branch in > arrow-rs (or a fork) would work just as well. But again, I am not opposed > to a new repo either. > > Andrew > > On Wed, May 26, 2021 at 3:47 AM Fernando Herrera < > fernando.j.herr...@gmail.com> wrote: > > > Thanks Jorge for the update and the continuous development on a > > safer version of arrow. > > > > I would like to give my support for option 3 as well. IMHO it will give > > arrow2 the exposition it needs to be considered by a wider set of > > users. This exposition will open the possibility to receive more > > participation regarding missing features required to integrate > > arrow2 to datafusion and ballista. > > > > It will also give peace of mind to arrow users that arrow2 will > > follow the apache way, meaning that its development will be stable, > > supported and community driven. I have dealt with this issue myself, > > where I have suggested the use of arrow2 for new projects, only > > to be discarded because of the impression that it isn't supported > > by the apache community; even after seeing the advantages the > > project presents. > > > > > > Fernando > > > > On Wed, May 26, 2021 at 6:38 AM Jorge Cardoso Leitão < > > jorgecarlei...@gmail.com> wrote: > > > > > Hi, > > > > > > I would like to offer an update on this. I continued to work heavily on > > > this hypothesis on separate repos [1, 2], as this required a ground-up > > > refactor. > > > > > > Following is the current status, and at the end some options that we can > > > think about. > > > > > > TL;DR: I would like to gauge the communities' interest in making arrow2 > > and > > > parquet2 experimental repos in Apache Arrow. IMO they are safer, faster, > > > more maintainable and equally compatible with both the arrow spec and > > > parquet spec. > > > > > > # Specification and interoperability > > > > > > IPC: > > > > > > All integration tests that in 4.1.0 runs pass against > > apache/arrow@master. > > > Furthermore, it passes the following tests that 4.1.0 does not: > > > * big endian IPC producer and consumer > > > * decimal128 > > > * interval > > > * nested_large_offsets > > > > > > FFI: All integration tests that 4.1.0 runs pass against pyarrow==4.0. > > > > > > Arrow-Flight: it is the same code; I am not sure the tests in 4.1.0 are > > > passing or are skipped. > > > > > > parquet: arrow2 tests against parquet files generated by pyarrow under > > > different configurations: > > > * physical and logical types > > > * page versions > > > * repetition levels > > > * dictionary encoding > > > > > > # Safety > > > > > > * arrow2 addresses all security vulnerabilities (all our +20 issues > > labeled > > > with "security" [3] and more currently not encapsulated in any issue) and > > > unsafety issues. In particular, > > > * all tests pass under MIRI checks > > > * all unsafe APIs are marked as unsafe > > > * parquet2 does not use unsafe > > > > > > # Maintenance > > > > > > * arrow + parquet has a total of 56k+36k LOC, excluding headers > > > * arrow2 + parquet2 has a total of 50k+7k LOC, excluding headers > > > * arrow2 coverage is 76%, arrow is 88% > > > * parquet2 is "unsafe"-free > > > > > > # Features > > > > > > Non-spec wise (e.g. compute, utils, parquet), the crate has about 90% of > > > all the features in 4.0. What is missing: > > > * nested read and write parquet (lists and structs) > > > * missing some new features since 4.0.0 > > > > > > OTOH, it has the following additional features: > > > > > > * API to read CSV in parallel > > > * API to read parquet in parallel > > > * checked_X, saturating_X, overflowing_X operations (i.e. non-panic > > > versions of add, subtract, etc.) > > > * arithmetics ops over dates, timestamps and durations > > > * display for every logical type > > > * more cast operations > > > * merge-sort kernel > > > * vectorized hashing > > > > > > # Performance > > > > > > * 3-15x reading and writing parquet, and APIs to read them in parallel > > (see > > > [4]) > > > * faster IPC read > > > * arithmetics and the like are about the same performance as arrow 4.0 > > > compiled with SIMD (available in nightly), ~1.5x faster without SIMD and > > > nightly. > > > * Some kernels degrade by about 20% due to bound checks (e.g. boolean > > > "take" in arrow 4.1 allows out of bound reads and is thus faster). > > > * Sort and filter ~2x faster. See [5,6] > > > > > > # Interoperability with DataFusion > > > > > > I have an experimental PR [7] in DataFusion. The THPC1 yields the same > > > result and has about the same performance (datafusion can perform > > > out-of-bound reads from arrow...), without group bys, it is ~2x faster. > > > > > > # Process / Community > > > > > > This is a "big chunk of code" type of situation developed over an > > external > > > repo. I tried to keep folks informed of the status and what was being > > done; > > > the mono-repo at the time was really difficult to cope with. With this > > > said: > > > > > > * I proposed an experimental repo mechanism so that we can conduct this > > > type of activities (now merged [8]) > > > * I am not merging PRs that introduce new major API, so that the > > community > > > can weight in > > > > > > # Licensing > > > > > > All code is licensed under MIT and Apache; contributors are required to > > > accept any of these as part of their contributions. I can work to handle > > > this part with the incubator / ASF ahead of any potential code movement. > > > > > > ----------------- > > > > > > Main question: what do we do? > > > > > > Some ideas: > > > > > > 1. PR all this to the arrow-rs repo > > > 2. push this to a branch in arrow-rs > > > 3. move this to an experimental repo within ASF and work on it until we > > > have feature parity (e.g. read and write nested types to/from parquet), > > and > > > then apply 1 or 2 > > > 4. do nothing > > > > > > Concerns with option 1: > > > * development will continue to happen outside ASF > > > * no easy way to collaborate: issue tracking for this code outside ASF > > > * no easy way for the community to weight in over changes to the API > > prior > > > to merge > > > Concerns with option 2: > > > * issue tracking about branches is confusing, specially in creating > > change > > > logs > > > * PRs to branches is confusing > > > * no easy way for the community to weight in over changes to the API > > prior > > > to merge > > > Concerns with option 3: > > > * it would be the first experimental repo, thus some risk > > > Concerns with option 4: > > > * for the time being this would continue to be a project independent of > > the > > > Apache Arrow > > > * I would release 0.1 to crate.io as "arrow-safe" or something, as there > > > is > > > demand for it. > > > > > > I would be in favor of option 3 for the following reason: I do not think > > it > > > is useful to PR a -92k +57k change without giving the community extensive > > > time to evaluate, contribute, and time for a proper discussion. In this > > > context, my idea here was to give some time for the ideas to mature > > within > > > ASF, and only then even consider a switch. > > > > > > Thanks, > > > Jorge > > > > > > [1] https://github.com/jorgecarleitao/arrow2 > > > [2] https://github.com/jorgecarleitao/parquet2 > > > [3] > > > > > > > > https://github.com/apache/arrow-rs/issues?q=is%3Aopen+is%3Aissue+label%3Asecurity > > > [4] > > > > > > > > https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045 > > > [5] > > > > > > > > https://jorgecarleitao.medium.com/safe-analytics-with-rust-and-arrow-564f05107dd2 > > > [6] > > > > > > > > https://docs.google.com/spreadsheets/d/1hLKsqJaw_VLjtJCgQ635R9iHDNYwZscE0OT1omdZuwg/edit#gid=402497043 > > > [7] https://github.com/apache/arrow-datafusion/pull/68 > > > [8] https://issues.apache.org/jira/browse/ARROW-12643 > > > > > > > > > On Sun, Feb 7, 2021 at 2:42 PM Jorge Cardoso Leitão < > > > jorgecarlei...@gmail.com> wrote: > > > > > > > Hi, > > > > > > > > Over the past 4 months, I have been growing more and more frustrated by > > > > the amount of undefined behaviour that I am finding and fixing on the > > > Rust > > > > implementation. I would like to open the discussion of a broader > > overview > > > > about the problem in light of our current knowledge and what Rust > > enables > > > > as well as offer a solution to the bigger problem. > > > > > > > > Just to give you a gist of the seriousness of the issue, the following > > > > currently compiles, runs, and is undefined behavior in Rust: > > > > > > > > let buffer = Buffer::from(&[0i32, 2i32]);let data = > > > ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer], vec![]);let > > > array = Float64Array::from(Arc::new(data)); > > > > println!("{:?}", array.value(1)); > > > > > > > > I would like to propose a major refactor of the crate around physical > > > > traits, Buffer, MutableBuffer and ArrayData to make our code type-safe > > at > > > > compile time, thereby avoiding things like the example above from > > > happening > > > > again. > > > > > > > > So far, I was able to reproduce all core features of the arrow crate > > > > (nested types, dynamic typing, FFI, memory alignment, performance) by > > > using > > > > `Buffer<T: NativeType>` instead of `Buffer` and removing `ArrayData` > > and > > > > RawPointer altogether. > > > > > > > > Safety-wise, it significantly limits the usage of `unsafe` on higher > > end > > > > APIs, it has a single transmute (the bit chunk iterator one), and a > > > > guaranteed safe public API (which is not the case in our master, as > > shown > > > > above). > > > > > > > > Performance wise, it yields a 1.3x improvement over the current master > > > > (after this fix <https://github.com/apache/arrow/pull/9301> of UB on > > the > > > > take kernel, 1.7x prior to it) for the `take` kernel for primitives. I > > > > should have other major performance improvements. > > > > > > > > API wise, it simplifies the traits that we have for memory layout as > > well > > > > as the handling of bitmaps, offsets, etc. > > > > > > > > The proposal is drafted as a README > > > > <https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md> on > > a > > > > repo that I created specifically for this from the ground up, and the > > > full > > > > set of changes are in a PR > > > > <https://github.com/jorgecarleitao/arrow2/pull/1> so that anyone can > > > view > > > > and comment on it. I haven't made any PR to master because this is too > > > > large to track as a diff against master, and is beyond the point, > > > anyways. > > > > > > > > I haven't ported most of the crate as I only tried the non-trivial > > > > features (memory layout, bitmaps, FFI, dynamic typing, nested types). > > > > > > > > I would highly appreciate your thoughts about it. > > > > > > > > Best, > > > > Jorge > > > > > > > > > > > > >