Hi, this sounds really promising. I'm curious how JS handles structarrays, but in theory it should work.
Best regards, Adam Lippai On Fri, Oct 30, 2020 at 3:07 PM Benjamin Kietzman <bengil...@gmail.com> wrote: > Hi Adam, > > Arrow does not support nesting tables inside other tables. However, a > record batch > is interchangeable with a struct array so you could achieve something > similar > by converting from a RecordBatch with columns `...c` to a StructArray with > child > arrays `...c`. In C++ we have /RecordBatch::{To,From}StructArray/ for this > purpose. > Only from_struct_array is exposed in python but to_struct_array would be a > simple > change to make. > > Grouping could then be emulated by sorting the StructArray and wrapping it > in a > ListArray so that each list item contains the rows of a group. (This is > similar to > Impala's interpretation of list and map columns as persistent > joins/groupings > > https://docs.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types_queries > ) > > Would that be sufficient for your use case? > > On Thu, Oct 29, 2020 at 5:19 PM Adam Lippai <a...@rigo.sk> wrote: > > > This is what I want to extend for multiple tables: > > > > > https://issues.apache.org/jira/browse/ARROW-10045?focusedCommentId=17207790&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17207790 > > I would need to come up with custom binary wrapper for multiple > serialized > > pyarrow tables and since Arrow supports hierarchical data to some level, > I > > was looking for built-in support of nested tables. > > I understand this might not be available on API level. > > > > Best regards, > > Adam Lippai > > > > On Thu, Oct 29, 2020 at 10:14 PM Adam Lippai <a...@rigo.sk> wrote: > > > > > If I have a DataFrame with columns Date, Category, Value and group by > > > Category I'll have multiple DataFrames with Date, Value columns. > > > The result of the groupby is DataFrameGroupBy, which can't be > serialized. > > > This is why I tried to assemble a nested DataFrame instead (like the > one > > in > > > the SO link previously), but that doesn't work either. > > > > > > As Apache Arrow JS doesn't support groupby (processing the original DF > on > > > the client-side), I was thinking of pushing the groupby operation to > the > > > server side (pyarrow), doing the groupby in pandas before serializing > and > > > sending it to the client. > > > I was wondering whether this (nested arrow tables) is a supported > feature > > > or not (by calling chained table.toArray() or similar solution) > > > Currently I process it in pure JS, it's not that ugly, but not really > > > idiomatic either. The lack of Categorial data type and processing it > row > > by > > > row certainly has it's perf. price. > > > > > > Best regards, > > > Adam Lippai > > > > > > On Thu, Oct 29, 2020 at 9:39 PM Joris Van den Bossche < > > > jorisvandenboss...@gmail.com> wrote: > > > > > >> Can you give a more specific example of what kind of hierarchical data > > >> you want to serialize? (eg the output of a groupby operation in pandas > > >> typically is still a dataframe that can be converted to pyarrow and > > >> serialized). > > >> > > >> In general, for hierarchical data we have the nested data types (eg > > >> struct type when you nest "multiple columns in a single column"). > > >> > > >> Joris > > >> > > >> > > >> On Thu, 29 Oct 2020 at 15:29, Adam Lippai <a...@rigo.sk> wrote: > > >> > > > >> > Hi, > > >> > > > >> > is there a way to serialize (IPC) hierarchical tabular data (eg. > > output > > >> of > > >> > pandas groupby) in python? > > >> > I've tried to call pa.ipc.serialize_pandas() on this example, but it > > >> throws > > >> > error: > > >> > > > https://stackoverflow.com/questions/51505504/pandas-nesting-dataframes > > >> > > > >> > Best regards, > > >> > Adam Lippai > > >> > > > > > >