I have just created a wiki page to organize work and follow-on
projects, as this is likely to be a pretty large project that spans
many months of development:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89070249

You can add JIRAs to the list by applying the "csv" label

- Wes

On Sun, Aug 19, 2018 at 4:10 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> Hello Antoine and Wes,
>
> really excited to see this happen. CSVs and co are the file formats you never 
> get rid of, so it is really important to have an Arrow reader. Concerning the 
> custom implementation I can further back this as during the parquet_arrow 
> reader, I have spent quite some amount of building custom, optimized paths 
> that produce Arrow columns directly instead of using a more parquet-native 
> intermediate. For example, the methods postfixed with *spaced in parquet-cpp 
> brought a 2-4x improvement in the read performance in contrast to the more 
> general implementations.
>
> Uwe
>
> On Fri, Aug 17, 2018, at 10:33 PM, Wes McKinney wrote:
>> hi all,
>>
>> Early in the project I created the issue
>>
>> https://issues.apache.org/jira/browse/ARROW-25
>>
>> about creating a high performance CSV file reader that returns Arrow
>> record batches. Many data systems have invested significant energies
>> in solving this problem, so why would we build Yet Another CSV Reader
>> in Apache Arrow? I originally wrote pandas.read_csv, for example.
>>
>> Well, there are in fact some really good reasons.
>>
>> 1) There has been a number of advances in designs for CSV readers to
>> leverage multiple cores for better performance, as an example
>>
>> * https://github.com/wiseio/paratext
>> * the datatable::fread function in R
>>
>> and others. Many existing CSV readers can and should be rearchitected
>> to take advantage of these designs.
>>
>> 2) The hot paths in CSV parsing tend to be highly particular to the
>> target data structures. Utilizing intermediate data structures hurts
>> performance in a meaningful way. Also, the orientation (columnar vs.
>> non-columnar) impacts the general design of the computational hot
>> paths
>>
>> Other computational choices, such as how to handle erroneous values or
>> nulls, or whether to dictionary-encode string columns (such as using
>> Arrow's dictionary encoding) has an impact on design as well.
>>
>> Thus, the highest performance CSV reader must be specialized to the
>> Arrow columnar layout in its hot paths.
>>
>> 3) Many applications spend a lot of their time converting text files
>> into tables. So solving the problem well pays long term dividends
>>
>> 4) As a development platform, solving the problem well in Apache Arrow
>> will enable many downstream consumers to profit from performance and
>> IO gains, and having this critical piece of shared infrastructure in a
>> community project will drive contributions back upstream into Arrow.
>> For example, we could use this easily in Python, R, and Ruby.
>>
>> 5) By building inside Arrow we can utilize common interfaces for IO
>> and concurrency: file system APIs, memory management (and taking
>> advantage of our jemalloc infrastructure [1]), on-the-fly
>> decompression, asynchronous / buffering input streams, thread
>> management, and others.
>>
>> There's probably some other reasons, but these are the main ones I think 
>> about.
>>
>> I spoke briefly about the project with Antoine and he has started
>> putting together the start of a reader in the C++ codebase:
>>
>> https://github.com/pitrou/arrow/tree/csv_reader
>>
>> I'm excited for this project to get off the ground as it will have a
>> lot of user-visible impact and pay dividends for many years. It would
>> be great for those who have worked on fast CSV parsing to share their
>> experiences and get involved to help make good design choices and take
>> advantage of lessons learned in other projects
>>
>> - Wes
>>
>> [1]: http://arrow.apache.org/blog/2018/07/20/jemalloc/

Reply via email to