In case it's interesting, I gave a talk a little over 3 years ago about this theme ("we all have data frames, but they're all different inside"): https://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly. I mentioned the desire for a "Apache-licensed, community standard C/C++ data frame that we can all use".
On Fri, Jul 6, 2018 at 1:53 PM, Alex Buchanan <bucha...@ohsu.edu> wrote: > Ok, interesting. Thanks Wes, that does make it clear. > > > For other readers, this github issue is related: > https://github.com/apache/arrow/issues/2189#issuecomment-402874836 > > > > On 7/6/18, 10:25 AM, "Wes McKinney" <wesmck...@gmail.com> wrote: > >>hi Alex, >> >>One of the goals of Apache Arrow is to define an open standard for >>in-memory columnar data (which may be called "tables" or "data frames" >>in some domains). Among other things, the Arrow columnar format is >>optimized for memory efficiency and analytical processing performance >>on very large (even larger-than-RAM) data sets. >> >>The way to think about it is that pandas has its own in-memory >>representation for columnar data, but it is "proprietary" to pandas. >>To make use of pandas's analytical facilities, you must convert data >>to pandas's memory representation. As an example, pandas represents >>strings as NumPy arrays of Python string objects, which is very >>wasteful. Uwe Korn recently demonstrated an approach to using Arrow >>inside pandas, but this would require a lot of work to port algorithms >>to run against Arrow: https://github.com/xhochy/fletcher >> >>We are working to develop the standard data frame type operations as >>reusable libraries within this project, and these will run natively >>against the Arrow columnar format. This is a big project; we would >>love to have you involved with the effort. One of the reasons I have >>spent so much of my time the last few years on this project is that I >>believe it is the best path to build a faster, more efficient >>pandas-like library for data scientists. >> >>best, >>Wes >> >>On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan <bucha...@ohsu.edu> wrote: >>> Hello all. >>> >>> I'm confused about the current level of integration between pandas and >>> pyarrow. Am I correct in understanding that currently I'll need to convert >>> pyarrow Tables to pandas DataFrames in order to use most of the pandas >>> features? By "pandas features" I mean every day slicing and dicing of >>> data: merge, filtering, melt, spread, etc. >>> >>> I have a dataframe which starts out from small files (< 1GB) and quickly >>> explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm >>> interested in whether arrow can provide a better, optimized dataframe. >>> >>> Thanks. >>>