Which range of types do we want to support ultimately? Right now I have supports for booleans, integers, floats, strings. I expect we'll need to optimize number parsing.
Regards Antoine. Le 26/08/2018 à 22:47, Wes McKinney a écrit : > I have just created a wiki page to organize work and follow-on > projects, as this is likely to be a pretty large project that spans > many months of development: > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89070249 > > You can add JIRAs to the list by applying the "csv" label > > - Wes > > On Sun, Aug 19, 2018 at 4:10 AM, Uwe L. Korn <uw...@xhochy.com> wrote: >> Hello Antoine and Wes, >> >> really excited to see this happen. CSVs and co are the file formats you >> never get rid of, so it is really important to have an Arrow reader. >> Concerning the custom implementation I can further back this as during the >> parquet_arrow reader, I have spent quite some amount of building custom, >> optimized paths that produce Arrow columns directly instead of using a more >> parquet-native intermediate. For example, the methods postfixed with *spaced >> in parquet-cpp brought a 2-4x improvement in the read performance in >> contrast to the more general implementations. >> >> Uwe >> >> On Fri, Aug 17, 2018, at 10:33 PM, Wes McKinney wrote: >>> hi all, >>> >>> Early in the project I created the issue >>> >>> https://issues.apache.org/jira/browse/ARROW-25 >>> >>> about creating a high performance CSV file reader that returns Arrow >>> record batches. Many data systems have invested significant energies >>> in solving this problem, so why would we build Yet Another CSV Reader >>> in Apache Arrow? I originally wrote pandas.read_csv, for example. >>> >>> Well, there are in fact some really good reasons. >>> >>> 1) There has been a number of advances in designs for CSV readers to >>> leverage multiple cores for better performance, as an example >>> >>> * https://github.com/wiseio/paratext >>> * the datatable::fread function in R >>> >>> and others. Many existing CSV readers can and should be rearchitected >>> to take advantage of these designs. >>> >>> 2) The hot paths in CSV parsing tend to be highly particular to the >>> target data structures. Utilizing intermediate data structures hurts >>> performance in a meaningful way. Also, the orientation (columnar vs. >>> non-columnar) impacts the general design of the computational hot >>> paths >>> >>> Other computational choices, such as how to handle erroneous values or >>> nulls, or whether to dictionary-encode string columns (such as using >>> Arrow's dictionary encoding) has an impact on design as well. >>> >>> Thus, the highest performance CSV reader must be specialized to the >>> Arrow columnar layout in its hot paths. >>> >>> 3) Many applications spend a lot of their time converting text files >>> into tables. So solving the problem well pays long term dividends >>> >>> 4) As a development platform, solving the problem well in Apache Arrow >>> will enable many downstream consumers to profit from performance and >>> IO gains, and having this critical piece of shared infrastructure in a >>> community project will drive contributions back upstream into Arrow. >>> For example, we could use this easily in Python, R, and Ruby. >>> >>> 5) By building inside Arrow we can utilize common interfaces for IO >>> and concurrency: file system APIs, memory management (and taking >>> advantage of our jemalloc infrastructure [1]), on-the-fly >>> decompression, asynchronous / buffering input streams, thread >>> management, and others. >>> >>> There's probably some other reasons, but these are the main ones I think >>> about. >>> >>> I spoke briefly about the project with Antoine and he has started >>> putting together the start of a reader in the C++ codebase: >>> >>> https://github.com/pitrou/arrow/tree/csv_reader >>> >>> I'm excited for this project to get off the ground as it will have a >>> lot of user-visible impact and pay dividends for many years. It would >>> be great for those who have worked on fast CSV parsing to share their >>> experiences and get involved to help make good design choices and take >>> advantage of lessons learned in other projects >>> >>> - Wes >>> >>> [1]: http://arrow.apache.org/blog/2018/07/20/jemalloc/