Ok, interesting. Thanks Wes, that does make it clear.
For other readers, this github issue is related: https://github.com/apache/arrow/issues/2189#issuecomment-402874836 On 7/6/18, 10:25 AM, "Wes McKinney" <wesmck...@gmail.com> wrote: >hi Alex, > >One of the goals of Apache Arrow is to define an open standard for >in-memory columnar data (which may be called "tables" or "data frames" >in some domains). Among other things, the Arrow columnar format is >optimized for memory efficiency and analytical processing performance >on very large (even larger-than-RAM) data sets. > >The way to think about it is that pandas has its own in-memory >representation for columnar data, but it is "proprietary" to pandas. >To make use of pandas's analytical facilities, you must convert data >to pandas's memory representation. As an example, pandas represents >strings as NumPy arrays of Python string objects, which is very >wasteful. Uwe Korn recently demonstrated an approach to using Arrow >inside pandas, but this would require a lot of work to port algorithms >to run against Arrow: https://github.com/xhochy/fletcher > >We are working to develop the standard data frame type operations as >reusable libraries within this project, and these will run natively >against the Arrow columnar format. This is a big project; we would >love to have you involved with the effort. One of the reasons I have >spent so much of my time the last few years on this project is that I >believe it is the best path to build a faster, more efficient >pandas-like library for data scientists. > >best, >Wes > >On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan <bucha...@ohsu.edu> wrote: >> Hello all. >> >> I'm confused about the current level of integration between pandas and >> pyarrow. Am I correct in understanding that currently I'll need to convert >> pyarrow Tables to pandas DataFrames in order to use most of the pandas >> features? By "pandas features" I mean every day slicing and dicing of data: >> merge, filtering, melt, spread, etc. >> >> I have a dataframe which starts out from small files (< 1GB) and quickly >> explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm >> interested in whether arrow can provide a better, optimized dataframe. >> >> Thanks. >>