Are there more details on what exactly an "Arrow Intermediate
Representation (AIR)" is?  We've talked about in the past maybe having a
memory layout specification for row-based data as well as column based
data.  There was also a recent attempt at least in C++ to try to build
utilities to do these pivots but it was decided that it didn't add much
utility (it was added a comprehensive example).

Thanks,
Micah

On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel <laurent.que...@gmail.com>
wrote:

> In the context of this OTEP
> <https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
> >
> (OpenTelemetry
> Enhancement Proposal) I developed an integration layer on top of Apache
> Arrow (Go an Rust) to *facilitate the translation of row-oriented data
> stream into an arrow-based columnar representation*. In this particular
> case the goal was to translate all OpenTelemetry entities (metrics, logs,
> or traces) into Apache Arrow records. These entities can be quite complex
> and their corresponding Arrow schema must be defined on the fly. IMO, this
> approach is not specific to my specific needs but could be used in many
> other contexts where there is a need to simplify the integration between a
> row-oriented source of data and Apache Arrow. The trade-off is to have to
> perform the additional step of conversion to the intermediate
> representation, but this transformation does not require to understand the
> arcana of the Arrow format and allows to potentially benefit from
> functionalities such as the encoding of the dictionary "for free", the
> automatic generation of Arrow schemas, the batching, the multi-column
> sorting, etc.
>
>
> I know that JSON can be used as a kind of intermediate representation in
> the context of Arrow with some language specific implementation. Current
> JSON integrations are insufficient to cover the most complex scenarios and
> are not standardized; e.g. support for most of the Arrow data type, various
> optimizations (string|binary dictionaries, multi-column sorting), batching,
> integration with Arrow IPC, compression ratio optimization, ... The object
> of this proposal is to progressively cover these gaps.
>
> I am looking to see if the community would be interested in such a
> contribution. Above are some additional details on the current
> implementation. All feedback is welcome.
>
> 10K ft overview of the current implementation:
>
>    1. Developers convert their row oriented stream into records based on
>    the Arrow Intermediate Representation (AIR). At this stage the
> translation
>    can be quite mechanical but if needed developers can decide for example
> to
>    translate a map into a struct if that makes sense for them. The current
>    implementation support the following arrow data types: bool, all uints,
> all
>    ints, all floats, string, binary, list of any supported types, and
> struct
>    of any supported types. Additional Arrow types could be added
> progressively.
>    2. The row oriented record (i.e. AIR record) is then added to a
>    RecordRepository. This repository will first compute a schema signature
> and
>    will route the record to a RecordBatcher based on this signature.
>    3. The RecordBatcher is responsible for collecting all the compatible
>    AIR records and, upon request, the "batcher" is able to build an Arrow
>    Record representing a batch of compatible inputs. In the current
>    implementation, the batcher is able to convert string columns to
> dictionary
>    based on a configuration. Another configuration allows to evaluate which
>    columns should be sorted to optimize the compression ratio. The same
>    optimization process could be applied to binary columns.
>    4. Steps 1 through 3 can be repeated on the same RecordRepository
>    instance to build new sets of arrow record batches. Subsequent
> iterations
>    will be slightly faster due to different techniques used (e.g. object
>    reuse, dictionary reuse and sorting, ...)
>
>
> The current Go implementation
> <https://github.com/lquerel/otel-arrow-adapter> (WIP) is currently part of
> this repo (see pkg/air package). If the community is interested, I could do
> a PR in the Arrow Go and Rust sub-projects.
>

Reply via email to