I have just created a wiki page to organize work and follow-on projects, as this is likely to be a pretty large project that spans many months of development:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89070249 You can add JIRAs to the list by applying the "csv" label - Wes On Sun, Aug 19, 2018 at 4:10 AM, Uwe L. Korn <uw...@xhochy.com> wrote: > Hello Antoine and Wes, > > really excited to see this happen. CSVs and co are the file formats you never > get rid of, so it is really important to have an Arrow reader. Concerning the > custom implementation I can further back this as during the parquet_arrow > reader, I have spent quite some amount of building custom, optimized paths > that produce Arrow columns directly instead of using a more parquet-native > intermediate. For example, the methods postfixed with *spaced in parquet-cpp > brought a 2-4x improvement in the read performance in contrast to the more > general implementations. > > Uwe > > On Fri, Aug 17, 2018, at 10:33 PM, Wes McKinney wrote: >> hi all, >> >> Early in the project I created the issue >> >> https://issues.apache.org/jira/browse/ARROW-25 >> >> about creating a high performance CSV file reader that returns Arrow >> record batches. Many data systems have invested significant energies >> in solving this problem, so why would we build Yet Another CSV Reader >> in Apache Arrow? I originally wrote pandas.read_csv, for example. >> >> Well, there are in fact some really good reasons. >> >> 1) There has been a number of advances in designs for CSV readers to >> leverage multiple cores for better performance, as an example >> >> * https://github.com/wiseio/paratext >> * the datatable::fread function in R >> >> and others. Many existing CSV readers can and should be rearchitected >> to take advantage of these designs. >> >> 2) The hot paths in CSV parsing tend to be highly particular to the >> target data structures. Utilizing intermediate data structures hurts >> performance in a meaningful way. Also, the orientation (columnar vs. >> non-columnar) impacts the general design of the computational hot >> paths >> >> Other computational choices, such as how to handle erroneous values or >> nulls, or whether to dictionary-encode string columns (such as using >> Arrow's dictionary encoding) has an impact on design as well. >> >> Thus, the highest performance CSV reader must be specialized to the >> Arrow columnar layout in its hot paths. >> >> 3) Many applications spend a lot of their time converting text files >> into tables. So solving the problem well pays long term dividends >> >> 4) As a development platform, solving the problem well in Apache Arrow >> will enable many downstream consumers to profit from performance and >> IO gains, and having this critical piece of shared infrastructure in a >> community project will drive contributions back upstream into Arrow. >> For example, we could use this easily in Python, R, and Ruby. >> >> 5) By building inside Arrow we can utilize common interfaces for IO >> and concurrency: file system APIs, memory management (and taking >> advantage of our jemalloc infrastructure [1]), on-the-fly >> decompression, asynchronous / buffering input streams, thread >> management, and others. >> >> There's probably some other reasons, but these are the main ones I think >> about. >> >> I spoke briefly about the project with Antoine and he has started >> putting together the start of a reader in the C++ codebase: >> >> https://github.com/pitrou/arrow/tree/csv_reader >> >> I'm excited for this project to get off the ground as it will have a >> lot of user-visible impact and pay dividends for many years. It would >> be great for those who have worked on fast CSV parsing to share their >> experiences and get involved to help make good design choices and take >> advantage of lessons learned in other projects >> >> - Wes >> >> [1]: http://arrow.apache.org/blog/2018/07/20/jemalloc/