Speaking of Arrow's JS implementation, there's one small (+2 −2) JS pull request in the queue that could use a review.
ARROW-11706: [JS] Better BigInt compatibility check https://issues.apache.org/jira/browse/ARROW-11706 https://github.com/apache/arrow/pull/9110 Have a great weekend, folks! --diana On Fri, Feb 26, 2021 at 7:34 PM Weston Pace <weston.p...@gmail.com> wrote: > > I used Arrow for this purpose in the past. I don't have much to add > but just a few thoughts off the top of my head... > > * The line between data and metadata can be blurry - For most > measurements we were able to store the "expected distribution" as > metadata (e.g. this measurement should have an expected value of 10 > +/- 3) and that could be used for drawing limit lines. For some > measurements however the common practice in place was to store the > upper/lower limit as separate columns because they often changed > depending on the various independent variables. In that case the same > "concept" (limit) might be stored in data or metadata. > > * Distinction between "data" and a "chart" - For us, we introduced a > separate representation called the "chart" between the data and the > rendering layer. So using that limit line example before if we wanted > to plot a histogram of some column then we would create a bar chart > from the column. This bar chart itself was also an array of numbers > but, since these arrays were much smaller (one per bin, hard limit to > bin count in the thousands based on # of pixels in display), and the > structure was much more deeply nested, we ended up just using JSON for > charts. The "limit" metadata belonged to the data and it was > translated into a vertical line element as part of the chart. > > * Processing layer - For us it was too expensive to send the data > across the Internet for display. So the conversion from data -> chart > happened with the datacenter close to the actual data. The JS UI was > simply responsible for chart -> pixels (well, SVG). It sounds like > you plan on doing the processing in JS. This can work, I'm just > tossing out alternatives to think about. You can even have a hybrid > model where some initial filtering happens in the datacenter and then > chart calculation / rendering happens in JS. > > * Expressions for group/split - Arrow expressions / compute are > starting to become available (and more work is being done on in-arrow > query engines). These can be very helpful for things like grouping or > splitting. For example, if you want to plot two line charts, one for > model X and one for model Y then you can define your split using > expressions. Unfortunately, these are pretty big features and I don't > think they are in the JS library. However, the existing C++/Rust work > could serve as examples for how you might want to tackle this. You > will need a fair amount of compute to go from data to chart > (histograms, averages, standard deviations, etc.). In my case I used > pandas pretty extensively for this since the Arrow compute features > didn't exist yet. There are some JS libraries for this (e.g. d3) so > you can probably investigate that avenue as well. > > On Fri, Feb 26, 2021 at 12:05 PM Paul Taylor <ptay...@apache.org> wrote: > > > > Hi Michael, > > > > The answer to your question about metadata will likely be > > application-specific. > > > > For small amounts of metadata (i.e. communicating a bounding box of > > included geometry), there isn't much room for optimization, so a string > > could be fine. > > > > For larger amounts of metadata (or other constraints, like if the metadata > > needs to be constantly modified independent of the data), custom encodings > > or a second service and/or arrow table of the metadata could be the way to > > go. > > > > The metadata keys/values are UTF-8 strings, so nothing should prevent you > > from stuffing a base64-encoded protobuf in there. > > > > As for whether the library is maintained -- yes it is, but lately I've only > > had time to work on bug fixes or features required to maintain parity with > > the spec and other libs. > > > > I will be using Arrow JS in my work again soon, and that could justify more > > "quality of life" improvements again, but without other maintainers jumping > > in to contribute or needing it for my work, those things don't get done. > > > > I'd be happy to do a call with you or your team to give a short overview > > and introduction to the JS lib. You can also email me directly or in the > > #arrow-js channel on the-asf.slack.com with any questions. > > > > Best, > > Paul > > > > On Fri, Feb 26, 2021 at 1:47 PM Michael Lavina <michael.lav...@factset.com> > > wrote: > > > > > Hey Neal, > > > > > > Thanks for the response and I am glad I am using this correctly. I have > > > never really used email servers so hopefully this works. > > > > > > That’s exactly what I was thinking of doing is to create a standard > > > metadata schema to built on top of Apache Arrow with some predefined user > > > types. > > > > > > I guess I was just wondering if I was trying to use a screwdriver as a > > > hammer. It can work because we are using the metadata and that could be > > > anything but maybe like you said we should be creating a separate standard > > > entirely for defining the schema to render tables instead of defining it > > > within Arrow. > > > > > > Does it defeat the value of Arrow if are sending the data using buffers > > > and stream and a giant string of stringified metadata when I could maybe > > > define the metadata in protobuf binary separately. > > > > > > In addition, I was curious with all these visualization tools has someone > > > already developed a standard metadata for arrow to help with rendering. > > > Stuff like how to denote grouping of data, relationship between columns > > > and > > > hidden information. > > > > > > -Michael > > > > > > From: Neal Richardson <neal.p.richard...@gmail.com> > > > Date: Friday, February 26, 2021 at 1:38 PM > > > To: dev <dev@arrow.apache.org> > > > Subject: Re: [JS] Exploring usage of apache arrow at my company for > > > complex table rendering > > > The Arrow IPC specification allows for custom metadata in both the Schema > > > and the individual Fields: > > > > > > https://urldefense.com/v3/__https://arrow.apache.org/docs/format/Columnar.html*schema-message__;Iw!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKvxxhHK7K$ > > > < > > > https://urldefense.com/v3/__https:/arrow.apache.org/docs/format/Columnar.html*schema-message__;Iw!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKvxxhHK7K$ > > > > > > > > > > Might that work for you? Another alternative would be to track your > > > metadata in a separate object outside of the Arrow data. > > > > > > Neal > > > > > > On Fri, Feb 26, 2021 at 5:02 AM Michael Lavina <michael.lav...@factset.com > > > > > > > wrote: > > > > > > > Hello Everyone, > > > > > > > > > > > > > > > > Some background. My name is Michael and I work at FactSet, which if you > > > > use Arrow you may have heard because one of our architects did a talk on > > > > using Arrow and Dremio. > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://hello.dremio.com/eliminate-data-transfer-bottlenecks-with-apache-arrow-flight.html?utm_medium=social-free&utm_source=linkedin&utm_term=na&utm_content=na&utm_campaign=eliminate-data-transfer-bottlenecks-with-apache-arrow-flight__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv9lV4pkV$ > > > < > > > https://urldefense.com/v3/__https:/hello.dremio.com/eliminate-data-transfer-bottlenecks-with-apache-arrow-flight.html?utm_medium=social-free&utm_source=linkedin&utm_term=na&utm_content=na&utm_campaign=eliminate-data-transfer-bottlenecks-with-apache-arrow-flight__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv9lV4pkV$ > > > > > > > > > > > > > > > > > > > > His team has decided to use Arrow as a tabular data interchange format. > > > > Other teams are doing other things. We are working on standardizing our > > > > tabular data interchange format at our company. > > > > > > > > > > > > > > > > We have our own open-sourced columnar based schema defined in protobuf. > > > > > > > https://urldefense.com/v3/__https://github.com/factset/stachschema__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv6XjzSrx$ > > > < > > > https://urldefense.com/v3/__https:/github.com/factset/stachschema__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv6XjzSrx$ > > > > > > > > > > > > > > > > > > > > We looked into Apache Arrow a few years ago, but decided not to use it > > > > as > > > > it was not mature enough at the time and we had two specific > > > > requirements > > > > > > > > 1) We needed this data not just for analytics but rendering as well and > > > > rendering requires a lot more complicated information such as > > > understanding > > > > the type of data and relationship between data i.e. grouping > > > > > > > > 2) We need SDKs that support typescript/javascript both browser and node > > > > and supports both creating and consuming arrow. > > > > > > > > > > > > > > > > Now that Apache Arrow is more mature and stabilized i.e. the schema and > > > > sdks are post 1.x we are looking into it again. > > > > > > > > > > > > > > > > 1. we are thinking of defining specific metadata in a similar way we > > > > do for STACH that let’s us define some rendering specific e.g. adding > > > a > > > > metadata to a Field Schema called isHidden to denote whether we > > > > should > > > > render the data column or not. > > > > 2. It seems like there is a well developed javascript SDK that we can > > > > use. I am still reading the source code and the Observable articles > > > > to > > > > truly understand how it works. > > > > 1. I read one of the issues is that the JS library might be out > > > > sync, so do people know how actively that repo is maintained. > > > > 2. If there needs to be work done I think we would be able to help > > > > if we had some help getting started with understanding that repo. > > > > > > > > > > > > > > > > If possible we would be interested to continue to chat about the above > > > > ideas, get more information about if Apache Arrow is right for the job, > > > and > > > > if there is already discussion of other people are using arrow for > > > > rendering in addition to analytics. > > > > > > > > > > > > > > > > To clarify what I mean for existing render technologies I know stuff > > > > like > > > > Falcon and Perspective exist, but those seem to be for basic table > > > > rendering for simple tables. I mean to create a superset of arrow by > > > > definfing metadata that allows for complex nested headers and nested > > > rows. > > > > Something like the image below. Then you can imagine even more data > > > > attached such as describing the data and relationships to other data on > > > the > > > > page. You can image in the dataset there is some `personId` that is set > > > to > > > > not be rendered. This personId can then be used to gather more > > > information > > > > in another api call if you wanted to render a tooltip with maybe some > > > > bio > > > > information. In short, rendered tables require a lot more information > > > than > > > > just the data. Does it make sense to build this upon Arrow. > > > > > > > > > > > > > > > > > > > > > > > > -Thanks > > > > > > > > Michael > > > > > > > > > > > > > > >