Re: Building a fast Arrow-native delimited file reader (e.g. for CSVs)

Antoine Pitrou Sun, 26 Aug 2018 13:53:58 -0700


Which range of types do we want to support ultimately?
Right now I have supports for booleans, integers, floats, strings.
I expect we'll need to optimize number parsing.


Regards

Antoine.


Le 26/08/2018 à 22:47, Wes McKinney a écrit :
> I have just created a wiki page to organize work and follow-on
> projects, as this is likely to be a pretty large project that spans
> many months of development:
> 
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89070249
> 
> You can add JIRAs to the list by applying the "csv" label
> 
> - Wes
> 
> On Sun, Aug 19, 2018 at 4:10 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>> Hello Antoine and Wes,
>>
>> really excited to see this happen. CSVs and co are the file formats you 
>> never get rid of, so it is really important to have an Arrow reader. 
>> Concerning the custom implementation I can further back this as during the 
>> parquet_arrow reader, I have spent quite some amount of building custom, 
>> optimized paths that produce Arrow columns directly instead of using a more 
>> parquet-native intermediate. For example, the methods postfixed with *spaced 
>> in parquet-cpp brought a 2-4x improvement in the read performance in 
>> contrast to the more general implementations.
>>
>> Uwe
>>
>> On Fri, Aug 17, 2018, at 10:33 PM, Wes McKinney wrote:
>>> hi all,
>>>
>>> Early in the project I created the issue
>>>
>>> https://issues.apache.org/jira/browse/ARROW-25
>>>
>>> about creating a high performance CSV file reader that returns Arrow
>>> record batches. Many data systems have invested significant energies
>>> in solving this problem, so why would we build Yet Another CSV Reader
>>> in Apache Arrow? I originally wrote pandas.read_csv, for example.
>>>
>>> Well, there are in fact some really good reasons.
>>>
>>> 1) There has been a number of advances in designs for CSV readers to
>>> leverage multiple cores for better performance, as an example
>>>
>>> * https://github.com/wiseio/paratext
>>> * the datatable::fread function in R
>>>
>>> and others. Many existing CSV readers can and should be rearchitected
>>> to take advantage of these designs.
>>>
>>> 2) The hot paths in CSV parsing tend to be highly particular to the
>>> target data structures. Utilizing intermediate data structures hurts
>>> performance in a meaningful way. Also, the orientation (columnar vs.
>>> non-columnar) impacts the general design of the computational hot
>>> paths
>>>
>>> Other computational choices, such as how to handle erroneous values or
>>> nulls, or whether to dictionary-encode string columns (such as using
>>> Arrow's dictionary encoding) has an impact on design as well.
>>>
>>> Thus, the highest performance CSV reader must be specialized to the
>>> Arrow columnar layout in its hot paths.
>>>
>>> 3) Many applications spend a lot of their time converting text files
>>> into tables. So solving the problem well pays long term dividends
>>>
>>> 4) As a development platform, solving the problem well in Apache Arrow
>>> will enable many downstream consumers to profit from performance and
>>> IO gains, and having this critical piece of shared infrastructure in a
>>> community project will drive contributions back upstream into Arrow.
>>> For example, we could use this easily in Python, R, and Ruby.
>>>
>>> 5) By building inside Arrow we can utilize common interfaces for IO
>>> and concurrency: file system APIs, memory management (and taking
>>> advantage of our jemalloc infrastructure [1]), on-the-fly
>>> decompression, asynchronous / buffering input streams, thread
>>> management, and others.
>>>
>>> There's probably some other reasons, but these are the main ones I think 
>>> about.
>>>
>>> I spoke briefly about the project with Antoine and he has started
>>> putting together the start of a reader in the C++ codebase:
>>>
>>> https://github.com/pitrou/arrow/tree/csv_reader
>>>
>>> I'm excited for this project to get off the ground as it will have a
>>> lot of user-visible impact and pay dividends for many years. It would
>>> be great for those who have worked on fast CSV parsing to share their
>>> experiences and get involved to help make good design choices and take
>>> advantage of lessons learned in other projects
>>>
>>> - Wes
>>>
>>> [1]: http://arrow.apache.org/blog/2018/07/20/jemalloc/

Re: Building a fast Arrow-native delimited file reader (e.g. for CSVs)

Reply via email to