Laurent Querel created ARROW-17220:
--------------------------------------

             Summary: [Proposal] Arrow Intermediate Representation
                 Key: ARROW-17220
                 URL: https://issues.apache.org/jira/browse/ARROW-17220
             Project: Apache Arrow
          Issue Type: New Feature
          Components: Integration
            Reporter: Laurent Querel


In the context of this 
[OTEP|https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md]
 (OpenTelemetry Enhancement Proposal) I developed an integration layer on top 
of Apache Arrow (Go an Rust) to facilitate the translation of row-oriented data 
stream into an arrow-based columnar representation. In this particular case the 
goal was to translate all OpenTelemetry entities (metrics, logs, or traces) 
into Apache Arrow records. These entities can be quite complex and their 
corresponding Arrow schema must be defined on the fly. IMO, this approach is 
not specific to my specific needs but could be used in many other contexts 
where there is a need to simplify the integration between a row-oriented source 
of data and Apache Arrow.

I know that JSON can be used as a kind of intermediate representation in the 
context of Arrow. I think current JSON integrations are insufficient to cover 
the most complex scenarios; e.g. support for most of the Arrow data type, 
various optimizations (string|binary dictionaries, multi-column sorting), 
batching, integration with Arrow IPC, compression ratio optimization, ... The 
object of this proposal is to progressively cover these gaps.

I am looking to see if the community would be interested in such a 
contribution. Above are some additional details on the current implementation. 
All feedback is welcome.

10K ft overview of the current implementation:
 # Developers convert their row oriented stream into records based on the Arrow 
Intermediate Representation (AIR). At this stage the translation can be quite 
mechanical but if needed developers can decide for example to translate a map 
into a struct if that makes sense for them. The current implementation support 
the following arrow data types: bool, all uints, all ints, all floats, string, 
binary, list of any supported types, and struct of any supported types. 
Additional Arrow types could be added progressively.
 # The row oriented record (i.e. AIR record) is then added to a 
RecordRepository. This repository will first compute a schema signature and 
will route the record to a RecordBatcher based on this signature.
 # The RecordBatcher is responsible for collecting all the compatible AIR 
records and, upon request, the "batcher" is able to build an Arrow Record 
representing a batch of compatible inputs. In the current implementation, the 
batcher is able to convert string columns to dictionary based on a 
configuration. Another configuration allows to evaluate which columns should be 
sorted to optimize the compression ratio. The same optimization process could 
be applied to binary columns.
 # Steps 1 through 3 can be repeated on the same RecordRepository instance to 
build new sets of arrow record batches. Subsequent iterations will be slightly 
faster due to different techniques used (e.g. object reuse, dictionary reuse 
and sorting, ...)

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to