Are there more details on what exactly an "Arrow Intermediate Representation (AIR)" is? We've talked about in the past maybe having a memory layout specification for row-based data as well as column based data. There was also a recent attempt at least in C++ to try to build utilities to do these pivots but it was decided that it didn't add much utility (it was added a comprehensive example).
Thanks, Micah On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel <laurent.que...@gmail.com> wrote: > In the context of this OTEP > <https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md > > > (OpenTelemetry > Enhancement Proposal) I developed an integration layer on top of Apache > Arrow (Go an Rust) to *facilitate the translation of row-oriented data > stream into an arrow-based columnar representation*. In this particular > case the goal was to translate all OpenTelemetry entities (metrics, logs, > or traces) into Apache Arrow records. These entities can be quite complex > and their corresponding Arrow schema must be defined on the fly. IMO, this > approach is not specific to my specific needs but could be used in many > other contexts where there is a need to simplify the integration between a > row-oriented source of data and Apache Arrow. The trade-off is to have to > perform the additional step of conversion to the intermediate > representation, but this transformation does not require to understand the > arcana of the Arrow format and allows to potentially benefit from > functionalities such as the encoding of the dictionary "for free", the > automatic generation of Arrow schemas, the batching, the multi-column > sorting, etc. > > > I know that JSON can be used as a kind of intermediate representation in > the context of Arrow with some language specific implementation. Current > JSON integrations are insufficient to cover the most complex scenarios and > are not standardized; e.g. support for most of the Arrow data type, various > optimizations (string|binary dictionaries, multi-column sorting), batching, > integration with Arrow IPC, compression ratio optimization, ... The object > of this proposal is to progressively cover these gaps. > > I am looking to see if the community would be interested in such a > contribution. Above are some additional details on the current > implementation. All feedback is welcome. > > 10K ft overview of the current implementation: > > 1. Developers convert their row oriented stream into records based on > the Arrow Intermediate Representation (AIR). At this stage the > translation > can be quite mechanical but if needed developers can decide for example > to > translate a map into a struct if that makes sense for them. The current > implementation support the following arrow data types: bool, all uints, > all > ints, all floats, string, binary, list of any supported types, and > struct > of any supported types. Additional Arrow types could be added > progressively. > 2. The row oriented record (i.e. AIR record) is then added to a > RecordRepository. This repository will first compute a schema signature > and > will route the record to a RecordBatcher based on this signature. > 3. The RecordBatcher is responsible for collecting all the compatible > AIR records and, upon request, the "batcher" is able to build an Arrow > Record representing a batch of compatible inputs. In the current > implementation, the batcher is able to convert string columns to > dictionary > based on a configuration. Another configuration allows to evaluate which > columns should be sorted to optimize the compression ratio. The same > optimization process could be applied to binary columns. > 4. Steps 1 through 3 can be repeated on the same RecordRepository > instance to build new sets of arrow record batches. Subsequent > iterations > will be slightly faster due to different techniques used (e.g. object > reuse, dictionary reuse and sorting, ...) > > > The current Go implementation > <https://github.com/lquerel/otel-arrow-adapter> (WIP) is currently part of > this repo (see pkg/air package). If the community is interested, I could do > a PR in the Arrow Go and Rust sub-projects. >