hi Stuart, Currently pandas and NumPy only support flat, non-nested data. Nested data includes column value types including arrays, structs, maps, and unions. This enables you to analyze JSON-like data natively in-memory without pre-flattening or normalization.
There's also an open question about how to handle nested data results from SQL engines like Spark SQL, Drill, and Impala, since there are currently native / C-level data structures (outside of pure Python data structures) in wide use to place the data arriving via RPC. Arrow serves to fill this need. - Wes On Mon, Feb 22, 2016 at 1:51 PM, Stuart Axelbrooke <stu...@axelbrooke.com> wrote: > Hey Wes, > > Very exciting to see things moving along on the Python front. As you state > in your post, fast, ubiquitous columnar data will be a great foundation, > especially for more modern data processing and ETL tools. Though I am a > bit curious what you mean by nested columnar data... > > Thanks, > Stuart > > On Mon, Feb 22, 2016 at 10:25 AM Wes McKinney <w...@cloudera.com> wrote: > >> hi all, >> >> I did a little bit of analysis of the costs of serialization bottlenecks in >> data access for Python pandas users and how (at a high level, no perf >> numbers yet!) Apache Arrow will help: >> >> http://wesmckinney.com/blog/pandas-and-apache-arrow/ >> >> Feedback and comments welcome. >> >> cheers, >> Wes >>