I played around with it, for my use case I really like the new way of writing CSVs, it's much more obvious. I love the `read_stream_metadata` function as well.
I'm seeing a very slight speed (~8ms) improvement on my end, but I read a bunch of files in a directory and spit out a CSV, the bottleneck is the parsing of lots of files, but it's pretty quick per file. old: /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224 bytes took 1ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144 bytes took 1ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10 17127928 bytes took 159ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11 17127144 bytes took 160ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12 17130352 bytes took 158ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13 17128544 bytes took 158ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14 17128664 bytes took 158ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15 17128328 bytes took 158ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16 17129288 bytes took 158ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17 17131056 bytes took 158ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18 17130344 bytes took 158ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19 17128432 bytes took 160ms new: /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224 bytes took 1ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144 bytes took 1ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10 17127928 bytes took 157ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11 17127144 bytes took 152ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12 17130352 bytes took 154ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13 17128544 bytes took 153ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14 17128664 bytes took 154ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15 17128328 bytes took 153ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16 17129288 bytes took 152ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17 17131056 bytes took 153ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18 17130344 bytes took 155ms /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19 17128432 bytes took 153ms I'm going to chunk the dirs to speed up the reads and throw it into a par iter. On Fri, 28 May 2021 at 09:09, Josh Taylor <joshuatayl...@gmail.com> wrote: > Hi! > > I've been using arrow/arrow-rs for a while now, my use case is to parse > Arrow streaming files and convert them into CSV. > > Rust has been an absolute fantastic tool for this, the performance is > outstanding and I have had no issues using it for my use case. > > I would be happy to test out the branch and let you know what the > performance is like, as I was going to improve the current implementation > that i have for the CSV writer, as it takes a while for bigger datasets > (multi-GB). > > Josh > > > On Thu, 27 May 2021 at 22:49, Jed Brown <j...@jedbrown.org> wrote: > >> Andy Grove <andygrov...@gmail.com> writes: >> > >> > Looking at this purely from the DataFusion/Ballista point of view, what >> I >> > would be interested in would be having a branch of DF that uses arrow2 >> and >> > once that branch has all tests passing and can run queries with >> performance >> > that is at least as good as the original arrow crate, then cut over. >> > >> > However, for developers using the arrow APIs directly, I don't see an >> easy >> > path. We either try and gradually PR the changes in (which seems really >> > hard given that there are significant changes to APIs and internal data >> > structures) or we port some portion of the existing tests over to arrow2 >> > and then make that the official crate once all test pass. >> >> How feasible would it be to make a legacy module in arrow2 that would >> enable (some large subset of) existing arrow users to try arrow2 after >> adjusting their use statements? (That is, implement the public-facing >> legacy interfaces in terms of arrow2's new, safe interface.) This would >> make it easier to test with DataFusion/Ballista and external users of the >> current arrow crate, then cut over and let those packages update >> incrementally from legacy to modern arrow2. >> >> I think it would be okay to tolerate some performance degradation when >> working through these legacy interfaces,so long as there was confidence >> that modernizing the callers would recover the performance (as tests have >> been showing). >> >