hi all,

Early in the project I created the issue

https://issues.apache.org/jira/browse/ARROW-25

about creating a high performance CSV file reader that returns Arrow
record batches. Many data systems have invested significant energies
in solving this problem, so why would we build Yet Another CSV Reader
in Apache Arrow? I originally wrote pandas.read_csv, for example.

Well, there are in fact some really good reasons.

1) There has been a number of advances in designs for CSV readers to
leverage multiple cores for better performance, as an example

* https://github.com/wiseio/paratext
* the datatable::fread function in R

and others. Many existing CSV readers can and should be rearchitected
to take advantage of these designs.

2) The hot paths in CSV parsing tend to be highly particular to the
target data structures. Utilizing intermediate data structures hurts
performance in a meaningful way. Also, the orientation (columnar vs.
non-columnar) impacts the general design of the computational hot
paths

Other computational choices, such as how to handle erroneous values or
nulls, or whether to dictionary-encode string columns (such as using
Arrow's dictionary encoding) has an impact on design as well.

Thus, the highest performance CSV reader must be specialized to the
Arrow columnar layout in its hot paths.

3) Many applications spend a lot of their time converting text files
into tables. So solving the problem well pays long term dividends

4) As a development platform, solving the problem well in Apache Arrow
will enable many downstream consumers to profit from performance and
IO gains, and having this critical piece of shared infrastructure in a
community project will drive contributions back upstream into Arrow.
For example, we could use this easily in Python, R, and Ruby.

5) By building inside Arrow we can utilize common interfaces for IO
and concurrency: file system APIs, memory management (and taking
advantage of our jemalloc infrastructure [1]), on-the-fly
decompression, asynchronous / buffering input streams, thread
management, and others.

There's probably some other reasons, but these are the main ones I think about.

I spoke briefly about the project with Antoine and he has started
putting together the start of a reader in the C++ codebase:

https://github.com/pitrou/arrow/tree/csv_reader

I'm excited for this project to get off the ground as it will have a
lot of user-visible impact and pay dividends for many years. It would
be great for those who have worked on fast CSV parsing to share their
experiences and get involved to help make good design choices and take
advantage of lessons learned in other projects

- Wes

[1]: http://arrow.apache.org/blog/2018/07/20/jemalloc/

Reply via email to