In case it's interesting, I gave a talk a little over 3 years ago
about this theme ("we all have data frames, but they're all different
inside"): https://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly.
I mentioned the desire for a "Apache-licensed, community standard
C/C++ data frame that we can all use".

On Fri, Jul 6, 2018 at 1:53 PM, Alex Buchanan <bucha...@ohsu.edu> wrote:
> Ok, interesting. Thanks Wes, that does make it clear.
>
>
> For other readers, this github issue is related: 
> https://github.com/apache/arrow/issues/2189#issuecomment-402874836
>
>
>
> On 7/6/18, 10:25 AM, "Wes McKinney" <wesmck...@gmail.com> wrote:
>
>>hi Alex,
>>
>>One of the goals of Apache Arrow is to define an open standard for
>>in-memory columnar data (which may be called "tables" or "data frames"
>>in some domains). Among other things, the Arrow columnar format is
>>optimized for memory efficiency and analytical processing performance
>>on very large (even larger-than-RAM) data sets.
>>
>>The way to think about it is that pandas has its own in-memory
>>representation for columnar data, but it is "proprietary" to pandas.
>>To make use of pandas's analytical facilities, you must convert data
>>to pandas's memory representation. As an example, pandas represents
>>strings as NumPy arrays of Python string objects, which is very
>>wasteful. Uwe Korn recently demonstrated an approach to using Arrow
>>inside pandas, but this would require a lot of work to port algorithms
>>to run against Arrow: https://github.com/xhochy/fletcher
>>
>>We are working to develop the standard data frame type operations as
>>reusable libraries within this project, and these will run natively
>>against the Arrow columnar format. This is a big project; we would
>>love to have you involved with the effort. One of the reasons I have
>>spent so much of my time the last few years on this project is that I
>>believe it is the best path to build a faster, more efficient
>>pandas-like library for data scientists.
>>
>>best,
>>Wes
>>
>>On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan <bucha...@ohsu.edu> wrote:
>>> Hello all.
>>>
>>> I'm confused about the current level of integration between pandas and 
>>> pyarrow. Am I correct in understanding that currently I'll need to convert 
>>> pyarrow Tables to pandas DataFrames in order to use most of the pandas 
>>> features?  By "pandas features" I mean every day slicing and dicing of 
>>> data: merge, filtering, melt, spread, etc.
>>>
>>> I have a dataframe which starts out from small files (< 1GB) and quickly 
>>> explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm 
>>> interested in whether arrow can provide a better, optimized dataframe.
>>>
>>> Thanks.
>>>

Reply via email to