Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Wes McKinney Mon, 07 Jun 2021 14:55:15 -0700

On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
<[email protected]> wrote:
>
> Hi,
>
> Thanks a lot for your feedback. I agree with all the arguments put forward,
> including Andrew's point about the large change.
>
> I tried a gradual 4 months ago, but it was really difficult and I gave up.
> I estimate that the work involved is half the work of writing parquet2 and
> arrow2 in the first place. The internal dependency on ArrayData (the main
> culprit of the unsafe) on arrow-rs is so prevalent that all core components
> need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> compute, SIMD). I personally do not have the motivation to do it, though.
>
> Jed, the public API changes are small for end users. A typical migration is
> [1]. I agree that we can further reduce the change-set by keeping legacy
> interfaces available.
>
> Andy, on my machine, the current benchmarks on query 1 yield:
>
> type, master (ms), PR [2] for arrow2+parquet2 (ms)
> memory (-m): 332.9, 239.6
> load (the initial time in -m with --format parquet): 5286.0, 3043.0
> parquet format: 1316.1, 930.7
> tbl format: 5297.3, 5383.1
>
> i.e. I am observing some improvements. Queries with joins are still slower.
> The pruning of parquet groups and pages based on stats are not yet there; I
> am working on them.
>
> I agree that this should go through IP clearance. I will start this
> process. My thinking would be to create two empty repos on apache/*, and
> create 2 PRs from the main branches of each of my repos to those repos, and
> only merge them once IP is cleared. Would that be a reasonable process, Wes?


This sounds plenty fine to me — I'm happy to assist with the IP
clearance process having done it several times in the past. I don't
have an opinion about the names, but having experimental- in the name
sounds in line with the previous discussion we had about this.

> Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
>
> Best,
> Jorge
>
> [1]
> https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> [2] https://github.com/apache/arrow-datafusion/pull/68
>
>
> On Fri, May 28, 2021 at 5:22 AM Josh Taylor <[email protected]> wrote:
>
> > I played around with it, for my use case I really like the new way of
> > writing CSVs, it's much more obvious. I love the `read_stream_metadata`
> > function as well.
> >
> > I'm seeing a very slight speed (~8ms) improvement on my end, but I read a
> > bunch of files in a directory and spit out a CSV, the bottleneck is the
> > parsing of lots of files, but it's pretty quick per file.
> >
> > old:
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
> > bytes took 1ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
> > bytes took 1ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > 17127928 bytes took 159ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > 17127144 bytes took 160ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > 17130352 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > 17128544 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > 17128664 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > 17128328 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > 17129288 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > 17131056 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > 17130344 bytes took 158ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > 17128432 bytes took 160ms
> >
> > new:
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
> > bytes took 1ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
> > bytes took 1ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> > 17127928 bytes took 157ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> > 17127144 bytes took 152ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> > 17130352 bytes took 154ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> > 17128544 bytes took 153ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> > 17128664 bytes took 154ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> > 17128328 bytes took 153ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> > 17129288 bytes took 152ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> > 17131056 bytes took 153ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> > 17130344 bytes took 155ms
> > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> > 17128432 bytes took 153ms
> >
> > I'm going to chunk the dirs to speed up the reads and throw it into a par
> > iter.
> >
> > On Fri, 28 May 2021 at 09:09, Josh Taylor <[email protected]> wrote:
> >
> > > Hi!
> > >
> > > I've been using arrow/arrow-rs for a while now, my use case is to parse
> > > Arrow streaming files and convert them into CSV.
> > >
> > > Rust has been an absolute fantastic tool for this, the performance is
> > > outstanding and I have had no issues using it for my use case.
> > >
> > > I would be happy to test out the branch and let you know what the
> > > performance is like, as I was going to improve the current implementation
> > > that i have for the CSV writer, as it takes a while for bigger datasets
> > > (multi-GB).
> > >
> > > Josh
> > >
> > >
> > > On Thu, 27 May 2021 at 22:49, Jed Brown <[email protected]> wrote:
> > >
> > >> Andy Grove <[email protected]> writes:
> > >> >
> > >> > Looking at this purely from the DataFusion/Ballista point of view,
> > what
> > >> I
> > >> > would be interested in would be having a branch of DF that uses arrow2
> > >> and
> > >> > once that branch has all tests passing and can run queries with
> > >> performance
> > >> > that is at least as good as the original arrow crate, then cut over.
> > >> >
> > >> > However, for developers using the arrow APIs directly, I don't see an
> > >> easy
> > >> > path. We either try and gradually PR the changes in (which seems
> > really
> > >> > hard given that there are significant changes to APIs and internal
> > data
> > >> > structures) or we port some portion of the existing tests over to
> > arrow2
> > >> > and then make that the official crate once all test pass.
> > >>
> > >> How feasible would it be to make a legacy module in arrow2 that would
> > >> enable (some large subset of) existing arrow users to try arrow2 after
> > >> adjusting their use statements? (That is, implement the public-facing
> > >> legacy interfaces in terms of arrow2's new, safe interface.) This would
> > >> make it easier to test with DataFusion/Ballista and external users of
> > the
> > >> current arrow crate, then cut over and let those packages update
> > >> incrementally from legacy to modern arrow2.
> > >>
> > >> I think it would be okay to tolerate some performance degradation when
> > >> working through these legacy interfaces,so long as there was confidence
> > >> that modernizing the callers would recover the performance (as tests
> > have
> > >> been showing).
> > >>
> > >
> >

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Reply via email to