Hi,

this sounds really promising. I'm curious how JS handles structarrays, but
in theory it should work.

Best regards,
Adam Lippai

On Fri, Oct 30, 2020 at 3:07 PM Benjamin Kietzman <bengil...@gmail.com>
wrote:

> Hi Adam,
>
> Arrow does not support nesting tables inside other tables. However, a
> record batch
> is interchangeable with a struct array so you could achieve something
> similar
> by converting from a RecordBatch with columns `...c` to a StructArray with
> child
> arrays `...c`. In C++ we have /RecordBatch::{To,From}StructArray/ for this
> purpose.
> Only from_struct_array is exposed in python but to_struct_array would be a
> simple
> change to make.
>
> Grouping could then be emulated by sorting the StructArray and wrapping it
> in a
> ListArray so that each list item contains the rows of a group. (This is
> similar to
> Impala's interpretation of list and map columns as persistent
> joins/groupings
>
> https://docs.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types_queries
> )
>
> Would that be sufficient for your use case?
>
> On Thu, Oct 29, 2020 at 5:19 PM Adam Lippai <a...@rigo.sk> wrote:
>
> > This is what I want to extend for multiple tables:
> >
> >
> https://issues.apache.org/jira/browse/ARROW-10045?focusedCommentId=17207790&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17207790
> > I would need to come up with custom binary wrapper for multiple
> serialized
> > pyarrow tables and since Arrow supports hierarchical data to some level,
> I
> > was looking for built-in support of nested tables.
> > I understand this might not be available on API level.
> >
> > Best regards,
> > Adam Lippai
> >
> > On Thu, Oct 29, 2020 at 10:14 PM Adam Lippai <a...@rigo.sk> wrote:
> >
> > > If I have a DataFrame with columns Date, Category, Value and group by
> > > Category I'll have multiple DataFrames with Date, Value columns.
> > > The result of the groupby is DataFrameGroupBy, which can't be
> serialized.
> > > This is why I tried to assemble a nested DataFrame instead (like the
> one
> > in
> > > the SO link previously), but that doesn't work either.
> > >
> > > As Apache Arrow JS doesn't support groupby (processing the original DF
> on
> > > the client-side), I was thinking of pushing the groupby operation to
> the
> > > server side (pyarrow), doing the groupby in pandas before serializing
> and
> > > sending it to the client.
> > > I was wondering whether this (nested arrow tables) is a supported
> feature
> > > or not (by calling chained table.toArray() or similar solution)
> > > Currently I process it in pure JS, it's not that ugly, but not really
> > > idiomatic either. The lack of Categorial data type and processing it
> row
> > by
> > > row certainly has it's perf. price.
> > >
> > > Best regards,
> > > Adam Lippai
> > >
> > > On Thu, Oct 29, 2020 at 9:39 PM Joris Van den Bossche <
> > > jorisvandenboss...@gmail.com> wrote:
> > >
> > >> Can you give a more specific example of what kind of hierarchical data
> > >> you want to serialize? (eg the output of a groupby operation in pandas
> > >> typically is still a dataframe that can be converted to pyarrow and
> > >> serialized).
> > >>
> > >> In general, for hierarchical data we have the nested data types (eg
> > >> struct type when you nest "multiple columns in a single column").
> > >>
> > >> Joris
> > >>
> > >>
> > >> On Thu, 29 Oct 2020 at 15:29, Adam Lippai <a...@rigo.sk> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > is there a way to serialize (IPC) hierarchical tabular data (eg.
> > output
> > >> of
> > >> > pandas groupby) in python?
> > >> > I've tried to call pa.ipc.serialize_pandas() on this example, but it
> > >> throws
> > >> > error:
> > >> >
> > https://stackoverflow.com/questions/51505504/pandas-nesting-dataframes
> > >> >
> > >> > Best regards,
> > >> > Adam Lippai
> > >>
> > >
> >
>

Reply via email to