Hey Weston, Do you have any public code examples I could take a look at? This does sound very related to what I am doing.
One particular question I have related to grouping is how you define row-grouping. Column grouping is fairly simple I think you can just define a Struct that tells you how columns of data is grouped, but how would you go about grouping rows of data for example User Table First Name | Last Name | Country | State | City | Occupation // some data I have thought of basically two ways to do this. Send some metadata array i.e. groupBy that denotes how data should be grouped by and it’s a simple algorithm maybe something like [country, state, city]. But then you would need to store some mapping of a given rowIndex returns some rows of children based of that algorithm. And I think this would require all the data to be available to do the grouping. The other way is defining the structure of the data maybe something like (this could be entirely wrong I am new to Arrow sorry) list<struct<country, list<struct<state, list<struct<city, list<struct<firstName, lastName, occupation>>>>>>>> but basically the idea would be if you were to retrieve the data for a given index of let’s say a state it would return all the cities and vectors of data related to that given state. I also don’t know also if this is a limitations of my understanding of Arrow or the ArrowJs SDK library and this might be something very easy I am just not seeing it. -Michael From: Weston Pace <weston.p...@gmail.com> Date: Friday, February 26, 2021 at 9:34 PM To: dev@arrow.apache.org <dev@arrow.apache.org> Cc: Michael Lavina <michael.lav...@factset.com> Subject: Re: [JS] Exploring usage of apache arrow at my company for complex table rendering I used Arrow for this purpose in the past. I don't have much to add but just a few thoughts off the top of my head... * The line between data and metadata can be blurry - For most measurements we were able to store the "expected distribution" as metadata (e.g. this measurement should have an expected value of 10 +/- 3) and that could be used for drawing limit lines. For some measurements however the common practice in place was to store the upper/lower limit as separate columns because they often changed depending on the various independent variables. In that case the same "concept" (limit) might be stored in data or metadata. * Distinction between "data" and a "chart" - For us, we introduced a separate representation called the "chart" between the data and the rendering layer. So using that limit line example before if we wanted to plot a histogram of some column then we would create a bar chart from the column. This bar chart itself was also an array of numbers but, since these arrays were much smaller (one per bin, hard limit to bin count in the thousands based on # of pixels in display), and the structure was much more deeply nested, we ended up just using JSON for charts. The "limit" metadata belonged to the data and it was translated into a vertical line element as part of the chart. * Processing layer - For us it was too expensive to send the data across the Internet for display. So the conversion from data -> chart happened with the datacenter close to the actual data. The JS UI was simply responsible for chart -> pixels (well, SVG). It sounds like you plan on doing the processing in JS. This can work, I'm just tossing out alternatives to think about. You can even have a hybrid model where some initial filtering happens in the datacenter and then chart calculation / rendering happens in JS. * Expressions for group/split - Arrow expressions / compute are starting to become available (and more work is being done on in-arrow query engines). These can be very helpful for things like grouping or splitting. For example, if you want to plot two line charts, one for model X and one for model Y then you can define your split using expressions. Unfortunately, these are pretty big features and I don't think they are in the JS library. However, the existing C++/Rust work could serve as examples for how you might want to tackle this. You will need a fair amount of compute to go from data to chart (histograms, averages, standard deviations, etc.). In my case I used pandas pretty extensively for this since the Arrow compute features didn't exist yet. There are some JS libraries for this (e.g. d3) so you can probably investigate that avenue as well. On Fri, Feb 26, 2021 at 12:05 PM Paul Taylor <ptay...@apache.org> wrote: > > Hi Michael, > > The answer to your question about metadata will likely be > application-specific. > > For small amounts of metadata (i.e. communicating a bounding box of > included geometry), there isn't much room for optimization, so a string > could be fine. > > For larger amounts of metadata (or other constraints, like if the metadata > needs to be constantly modified independent of the data), custom encodings > or a second service and/or arrow table of the metadata could be the way to > go. > > The metadata keys/values are UTF-8 strings, so nothing should prevent you > from stuffing a base64-encoded protobuf in there. > > As for whether the library is maintained -- yes it is, but lately I've only > had time to work on bug fixes or features required to maintain parity with > the spec and other libs. > > I will be using Arrow JS in my work again soon, and that could justify more > "quality of life" improvements again, but without other maintainers jumping > in to contribute or needing it for my work, those things don't get done. > > I'd be happy to do a call with you or your team to give a short overview > and introduction to the JS lib. You can also email me directly or in the > #arrow-js channel on the-asf.slack.com with any questions. > > Best, > Paul > > On Fri, Feb 26, 2021 at 1:47 PM Michael Lavina <michael.lav...@factset.com> > wrote: > > > Hey Neal, > > > > Thanks for the response and I am glad I am using this correctly. I have > > never really used email servers so hopefully this works. > > > > That’s exactly what I was thinking of doing is to create a standard > > metadata schema to built on top of Apache Arrow with some predefined user > > types. > > > > I guess I was just wondering if I was trying to use a screwdriver as a > > hammer. It can work because we are using the metadata and that could be > > anything but maybe like you said we should be creating a separate standard > > entirely for defining the schema to render tables instead of defining it > > within Arrow. > > > > Does it defeat the value of Arrow if are sending the data using buffers > > and stream and a giant string of stringified metadata when I could maybe > > define the metadata in protobuf binary separately. > > > > In addition, I was curious with all these visualization tools has someone > > already developed a standard metadata for arrow to help with rendering. > > Stuff like how to denote grouping of data, relationship between columns and > > hidden information. > > > > -Michael > > > > From: Neal Richardson <neal.p.richard...@gmail.com> > > Date: Friday, February 26, 2021 at 1:38 PM > > To: dev <dev@arrow.apache.org> > > Subject: Re: [JS] Exploring usage of apache arrow at my company for > > complex table rendering > > The Arrow IPC specification allows for custom metadata in both the Schema > > and the individual Fields: > > > > https://urldefense.com/v3/__https://arrow.apache.org/docs/format/Columnar.html*schema-message__;Iw!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKvxxhHK7K$<https://urldefense.com/v3/__https:/arrow.apache.org/docs/format/Columnar.html*schema-message__;Iw!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKvxxhHK7K$> > > < > > https://urldefense.com/v3/__https:/arrow.apache.org/docs/format/Columnar.html*schema-message__;Iw!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKvxxhHK7K$ > > > > > > > Might that work for you? Another alternative would be to track your > > metadata in a separate object outside of the Arrow data. > > > > Neal > > > > On Fri, Feb 26, 2021 at 5:02 AM Michael Lavina <michael.lav...@factset.com > > > > > wrote: > > > > > Hello Everyone, > > > > > > > > > > > > Some background. My name is Michael and I work at FactSet, which if you > > > use Arrow you may have heard because one of our architects did a talk on > > > using Arrow and Dremio. > > > > > > > > > > > https://urldefense.com/v3/__https://hello.dremio.com/eliminate-data-transfer-bottlenecks-with-apache-arrow-flight.html?utm_medium=social-free&utm_source=linkedin&utm_term=na&utm_content=na&utm_campaign=eliminate-data-transfer-bottlenecks-with-apache-arrow-flight__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv9lV4pkV$<https://urldefense.com/v3/__https:/hello.dremio.com/eliminate-data-transfer-bottlenecks-with-apache-arrow-flight.html?utm_medium=social-free&utm_source=linkedin&utm_term=na&utm_content=na&utm_campaign=eliminate-data-transfer-bottlenecks-with-apache-arrow-flight__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv9lV4pkV$> > > < > > https://urldefense.com/v3/__https:/hello.dremio.com/eliminate-data-transfer-bottlenecks-with-apache-arrow-flight.html?utm_medium=social-free&utm_source=linkedin&utm_term=na&utm_content=na&utm_campaign=eliminate-data-transfer-bottlenecks-with-apache-arrow-flight__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv9lV4pkV$ > > > > > > > > > > > > > > > His team has decided to use Arrow as a tabular data interchange format. > > > Other teams are doing other things. We are working on standardizing our > > > tabular data interchange format at our company. > > > > > > > > > > > > We have our own open-sourced columnar based schema defined in protobuf. > > > > > https://urldefense.com/v3/__https://github.com/factset/stachschema__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv6XjzSrx$<https://urldefense.com/v3/__https:/github.com/factset/stachschema__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv6XjzSrx$> > > < > > https://urldefense.com/v3/__https:/github.com/factset/stachschema__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv6XjzSrx$ > > > > > > > > > > > > > > > We looked into Apache Arrow a few years ago, but decided not to use it as > > > it was not mature enough at the time and we had two specific requirements > > > > > > 1) We needed this data not just for analytics but rendering as well and > > > rendering requires a lot more complicated information such as > > understanding > > > the type of data and relationship between data i.e. grouping > > > > > > 2) We need SDKs that support typescript/javascript both browser and node > > > and supports both creating and consuming arrow. > > > > > > > > > > > > Now that Apache Arrow is more mature and stabilized i.e. the schema and > > > sdks are post 1.x we are looking into it again. > > > > > > > > > > > > 1. we are thinking of defining specific metadata in a similar way we > > > do for STACH that let’s us define some rendering specific e.g. adding > > a > > > metadata to a Field Schema called isHidden to denote whether we should > > > render the data column or not. > > > 2. It seems like there is a well developed javascript SDK that we can > > > use. I am still reading the source code and the Observable articles to > > > truly understand how it works. > > > 1. I read one of the issues is that the JS library might be out > > > sync, so do people know how actively that repo is maintained. > > > 2. If there needs to be work done I think we would be able to help > > > if we had some help getting started with understanding that repo. > > > > > > > > > > > > If possible we would be interested to continue to chat about the above > > > ideas, get more information about if Apache Arrow is right for the job, > > and > > > if there is already discussion of other people are using arrow for > > > rendering in addition to analytics. > > > > > > > > > > > > To clarify what I mean for existing render technologies I know stuff like > > > Falcon and Perspective exist, but those seem to be for basic table > > > rendering for simple tables. I mean to create a superset of arrow by > > > definfing metadata that allows for complex nested headers and nested > > rows. > > > Something like the image below. Then you can imagine even more data > > > attached such as describing the data and relationships to other data on > > the > > > page. You can image in the dataset there is some `personId` that is set > > to > > > not be rendered. This personId can then be used to gather more > > information > > > in another api call if you wanted to render a tooltip with maybe some bio > > > information. In short, rendered tables require a lot more information > > than > > > just the data. Does it make sense to build this upon Arrow. > > > > > > > > > > > > > > > > > > -Thanks > > > > > > Michael > > > > > > > > > > >