hi Alex, One of the goals of Apache Arrow is to define an open standard for in-memory columnar data (which may be called "tables" or "data frames" in some domains). Among other things, the Arrow columnar format is optimized for memory efficiency and analytical processing performance on very large (even larger-than-RAM) data sets.
The way to think about it is that pandas has its own in-memory representation for columnar data, but it is "proprietary" to pandas. To make use of pandas's analytical facilities, you must convert data to pandas's memory representation. As an example, pandas represents strings as NumPy arrays of Python string objects, which is very wasteful. Uwe Korn recently demonstrated an approach to using Arrow inside pandas, but this would require a lot of work to port algorithms to run against Arrow: https://github.com/xhochy/fletcher We are working to develop the standard data frame type operations as reusable libraries within this project, and these will run natively against the Arrow columnar format. This is a big project; we would love to have you involved with the effort. One of the reasons I have spent so much of my time the last few years on this project is that I believe it is the best path to build a faster, more efficient pandas-like library for data scientists. best, Wes On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan <bucha...@ohsu.edu> wrote: > Hello all. > > I'm confused about the current level of integration between pandas and > pyarrow. Am I correct in understanding that currently I'll need to convert > pyarrow Tables to pandas DataFrames in order to use most of the pandas > features? By "pandas features" I mean every day slicing and dicing of data: > merge, filtering, melt, spread, etc. > > I have a dataframe which starts out from small files (< 1GB) and quickly > explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm > interested in whether arrow can provide a better, optimized dataframe. > > Thanks. >