hi all, Early in the project I created the issue
https://issues.apache.org/jira/browse/ARROW-25 about creating a high performance CSV file reader that returns Arrow record batches. Many data systems have invested significant energies in solving this problem, so why would we build Yet Another CSV Reader in Apache Arrow? I originally wrote pandas.read_csv, for example. Well, there are in fact some really good reasons. 1) There has been a number of advances in designs for CSV readers to leverage multiple cores for better performance, as an example * https://github.com/wiseio/paratext * the datatable::fread function in R and others. Many existing CSV readers can and should be rearchitected to take advantage of these designs. 2) The hot paths in CSV parsing tend to be highly particular to the target data structures. Utilizing intermediate data structures hurts performance in a meaningful way. Also, the orientation (columnar vs. non-columnar) impacts the general design of the computational hot paths Other computational choices, such as how to handle erroneous values or nulls, or whether to dictionary-encode string columns (such as using Arrow's dictionary encoding) has an impact on design as well. Thus, the highest performance CSV reader must be specialized to the Arrow columnar layout in its hot paths. 3) Many applications spend a lot of their time converting text files into tables. So solving the problem well pays long term dividends 4) As a development platform, solving the problem well in Apache Arrow will enable many downstream consumers to profit from performance and IO gains, and having this critical piece of shared infrastructure in a community project will drive contributions back upstream into Arrow. For example, we could use this easily in Python, R, and Ruby. 5) By building inside Arrow we can utilize common interfaces for IO and concurrency: file system APIs, memory management (and taking advantage of our jemalloc infrastructure [1]), on-the-fly decompression, asynchronous / buffering input streams, thread management, and others. There's probably some other reasons, but these are the main ones I think about. I spoke briefly about the project with Antoine and he has started putting together the start of a reader in the C++ codebase: https://github.com/pitrou/arrow/tree/csv_reader I'm excited for this project to get off the ground as it will have a lot of user-visible impact and pay dividends for many years. It would be great for those who have worked on fast CSV parsing to share their experiences and get involved to help make good design choices and take advantage of lessons learned in other projects - Wes [1]: http://arrow.apache.org/blog/2018/07/20/jemalloc/