Re: pandas and Apache Arrow in context

Wes McKinney Mon, 22 Feb 2016 14:06:07 -0800

hi Stuart,

Currently pandas and NumPy only support flat, non-nested data. Nested
data includes column value types including arrays, structs, maps, and
unions. This enables you to analyze JSON-like data natively in-memory
without pre-flattening or normalization.


There's also an open question about how to handle nested data results
from SQL engines like Spark SQL, Drill, and Impala, since there are
currently native / C-level data structures (outside of pure Python
data structures) in wide use to place the data arriving via RPC. Arrow
serves to fill this need.

- Wes

On Mon, Feb 22, 2016 at 1:51 PM, Stuart Axelbrooke
<stu...@axelbrooke.com> wrote:
> Hey Wes,
>
> Very exciting to see things moving along on the Python front.  As you state
> in your post, fast, ubiquitous columnar data will be a great foundation,
> especially for more modern data processing and ETL tools.  Though I am a
> bit curious what you mean by nested columnar data...
>
> Thanks,
> Stuart
>
> On Mon, Feb 22, 2016 at 10:25 AM Wes McKinney <w...@cloudera.com> wrote:
>
>> hi all,
>>
>> I did a little bit of analysis of the costs of serialization bottlenecks in
>> data access for Python pandas users and how (at a high level, no perf
>> numbers yet!) Apache Arrow will help:
>>
>> http://wesmckinney.com/blog/pandas-and-apache-arrow/
>>
>> Feedback and comments welcome.
>>
>> cheers,
>> Wes
>>

Re: pandas and Apache Arrow in context

Reply via email to