Re: Building a fast Arrow-native delimited file reader (e.g. for CSVs)

Wes McKinney Sun, 26 Aug 2018 14:00:00 -0700

Let's start to make a list and create JIRAs. In terms of automatic
type inference, those are the main ones. But we also will need to be
able to explicitly indicate that a particular column should be
converted to:


* Dictionary-encoded string (NB we should think about the problem of
evolving dictionaries)
* date / time / timestamp (with some particular unit like ms or ns)
* Decimal

This is follow up work, so no need to boil the ocean with an initial patch.

We have plenty of other CSV readers we can look at to evaluate performance

- Wes

On Sun, Aug 26, 2018 at 4:53 PM, Antoine Pitrou <anto...@python.org> wrote:
>
> Which range of types do we want to support ultimately?
> Right now I have supports for booleans, integers, floats, strings.
> I expect we'll need to optimize number parsing.
>
> Regards
>
> Antoine.
>
>
> Le 26/08/2018 à 22:47, Wes McKinney a écrit :
>> I have just created a wiki page to organize work and follow-on
>> projects, as this is likely to be a pretty large project that spans
>> many months of development:
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89070249
>>
>> You can add JIRAs to the list by applying the "csv" label
>>
>> - Wes
>>
>> On Sun, Aug 19, 2018 at 4:10 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>>> Hello Antoine and Wes,
>>>
>>> really excited to see this happen. CSVs and co are the file formats you 
>>> never get rid of, so it is really important to have an Arrow reader. 
>>> Concerning the custom implementation I can further back this as during the 
>>> parquet_arrow reader, I have spent quite some amount of building custom, 
>>> optimized paths that produce Arrow columns directly instead of using a more 
>>> parquet-native intermediate. For example, the methods postfixed with 
>>> *spaced in parquet-cpp brought a 2-4x improvement in the read performance 
>>> in contrast to the more general implementations.
>>>
>>> Uwe
>>>
>>> On Fri, Aug 17, 2018, at 10:33 PM, Wes McKinney wrote:
>>>> hi all,
>>>>
>>>> Early in the project I created the issue
>>>>
>>>> https://issues.apache.org/jira/browse/ARROW-25
>>>>
>>>> about creating a high performance CSV file reader that returns Arrow
>>>> record batches. Many data systems have invested significant energies
>>>> in solving this problem, so why would we build Yet Another CSV Reader
>>>> in Apache Arrow? I originally wrote pandas.read_csv, for example.
>>>>
>>>> Well, there are in fact some really good reasons.
>>>>
>>>> 1) There has been a number of advances in designs for CSV readers to
>>>> leverage multiple cores for better performance, as an example
>>>>
>>>> * https://github.com/wiseio/paratext
>>>> * the datatable::fread function in R
>>>>
>>>> and others. Many existing CSV readers can and should be rearchitected
>>>> to take advantage of these designs.
>>>>
>>>> 2) The hot paths in CSV parsing tend to be highly particular to the
>>>> target data structures. Utilizing intermediate data structures hurts
>>>> performance in a meaningful way. Also, the orientation (columnar vs.
>>>> non-columnar) impacts the general design of the computational hot
>>>> paths
>>>>
>>>> Other computational choices, such as how to handle erroneous values or
>>>> nulls, or whether to dictionary-encode string columns (such as using
>>>> Arrow's dictionary encoding) has an impact on design as well.
>>>>
>>>> Thus, the highest performance CSV reader must be specialized to the
>>>> Arrow columnar layout in its hot paths.
>>>>
>>>> 3) Many applications spend a lot of their time converting text files
>>>> into tables. So solving the problem well pays long term dividends
>>>>
>>>> 4) As a development platform, solving the problem well in Apache Arrow
>>>> will enable many downstream consumers to profit from performance and
>>>> IO gains, and having this critical piece of shared infrastructure in a
>>>> community project will drive contributions back upstream into Arrow.
>>>> For example, we could use this easily in Python, R, and Ruby.
>>>>
>>>> 5) By building inside Arrow we can utilize common interfaces for IO
>>>> and concurrency: file system APIs, memory management (and taking
>>>> advantage of our jemalloc infrastructure [1]), on-the-fly
>>>> decompression, asynchronous / buffering input streams, thread
>>>> management, and others.
>>>>
>>>> There's probably some other reasons, but these are the main ones I think 
>>>> about.
>>>>
>>>> I spoke briefly about the project with Antoine and he has started
>>>> putting together the start of a reader in the C++ codebase:
>>>>
>>>> https://github.com/pitrou/arrow/tree/csv_reader
>>>>
>>>> I'm excited for this project to get off the ground as it will have a
>>>> lot of user-visible impact and pay dividends for many years. It would
>>>> be great for those who have worked on fast CSV parsing to share their
>>>> experiences and get involved to help make good design choices and take
>>>> advantage of lessons learned in other projects
>>>>
>>>> - Wes
>>>>
>>>> [1]: http://arrow.apache.org/blog/2018/07/20/jemalloc/

Re: Building a fast Arrow-native delimited file reader (e.g. for CSVs)

Reply via email to