Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Laurent Quérel Wed, 27 Jul 2022 20:38:44 -0700

Let me clarify the proposal a bit before replying to the various previous
feedbacks.

It seems to me that the process of converting a row-oriented data source
(row = set of fields or something more hierarchical) into an Arrow record
repeatedly raises the same challenges. A developer who must perform this
kind of transformation is confronted with the following questions and
problems:

- Understanding the Arrow API which can be challenging for complex cases of
rows representing complex objects (list of struct, struct of struct, ...).

- Decide which Arrow schema(s) will correspond to your data source. In some
complex cases it can be advantageous to translate the same row-oriented
data source into several Arrow schemas (e.g. OpenTelementry data sources).

- Decide on the encoding of the columns to make the most of the
column-oriented format and thus increase the compression rate (e.g. define
the columns that should be represent as dictionaries).

By experience, I can attest that this process is usually iterative. For
non-trivial data sources, arriving at the arrow representation that offers
the best compression ratio and is still perfectly usable and queryable is a
long and tedious process.

I see two approaches to ease this process and consequently increase the
adoption of Apache Arrow:

- Definition of a canonical in-memory row format specification that every
row-oriented data source provider can progressively adopt to get an
automatic translation into the Arrow format.

- Definition of an integration library allowing to map any row-oriented
source into a generic row-oriented source understood by the converter. It
is not about defining a unique in-memory format but more about defining a
standard API to represent row-oriented data.

In my opinion these two approaches are complementary. The first option is a
long-term approach targeting directly the data providers, which will
require to agree on this generic row-oriented format and whose adoption
will be more or less long. The second approach does not directly require
the collaboration of data source providers but allows an "integrator" to
perform this transformation painlessly with potentially several
representation trials to achieve the best results in his context.

The current proposal is an implementation of the second approach, i.e. an
API that maps a row-oriented source XYZ into an intermediate row-oriented
representation understood mechanically by the translator. This translator
also adds a series of optimizations to make the most of the Arrow format.

You can find multiple examples of a such transformation in the following
examples:

   -

https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/trace/otlp_to_arrow.go
   this example converts OTEL trace entities into their corresponding Arrow
   IR. At the end of this conversion the method returns a collection of Arrow
   Records.
   - A more complex example can be found here

https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/metrics/otlp_to_arrow.go.
   In this example a stream of OTEL univariate row-oriented metrics are
   translate into multivariate row-oriented metrics and then automatically
   translated into Apache Records.

In these two examples, the creation of dictionaries and multi-column
sorting is automatically done by the framework and the developer doesn’t
have to worry about the definition of Arrow schemas.

Now let's get to the answers.

@David Lee, I don't think Parquet and from_pylist() solve this problem
particularly well. Parquet is a column-oriented data file format and
doesn't really help to perform this transformation. The Python method is
relatively limited and language specific.

@Micah Kornfield, as described before my goal is not to define a memory
layout specification but more to define an API and a translation mechanism
able to take this intermediate representation (list of generic objects
representing the entities to translate) and to convert it into one or more
Arrow records.

@Wes McKinney, If I interpret your answer correctly, I think you are
describing the option 1 mentioned above. Like you I think it is an
interesting approach although complementary to the one I propose.

Looking forward to your feedback.

On Wed, Jul 27, 2022 at 4:19 PM Wes McKinney <wesmck...@gmail.com> wrote:

> We had an e-mail thread about this in 2018
>
> https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm
>
> I still think having a canonical in-memory row format (and libraries
> to transform to and from Arrow columnar format) is a good idea — but
> there is the risk of ending up in the tar pit of reinventing Avro.
>
>
> On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> > Are there more details on what exactly an "Arrow Intermediate
> > Representation (AIR)" is?  We've talked about in the past maybe having a
> > memory layout specification for row-based data as well as column based
> > data.  There was also a recent attempt at least in C++ to try to build
> > utilities to do these pivots but it was decided that it didn't add much
> > utility (it was added a comprehensive example).
> >
> > Thanks,
> > Micah
> >
> > On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel <laurent.que...@gmail.com
> >
> > wrote:
> >
> > > In the context of this OTEP
> > > <
> https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
> > > >
> > > (OpenTelemetry
> > > Enhancement Proposal) I developed an integration layer on top of Apache
> > > Arrow (Go an Rust) to *facilitate the translation of row-oriented data
> > > stream into an arrow-based columnar representation*. In this particular
> > > case the goal was to translate all OpenTelemetry entities (metrics,
> logs,
> > > or traces) into Apache Arrow records. These entities can be quite
> complex
> > > and their corresponding Arrow schema must be defined on the fly. IMO,
> this
> > > approach is not specific to my specific needs but could be used in many
> > > other contexts where there is a need to simplify the integration
> between a
> > > row-oriented source of data and Apache Arrow. The trade-off is to have
> to
> > > perform the additional step of conversion to the intermediate
> > > representation, but this transformation does not require to understand
> the
> > > arcana of the Arrow format and allows to potentially benefit from
> > > functionalities such as the encoding of the dictionary "for free", the
> > > automatic generation of Arrow schemas, the batching, the multi-column
> > > sorting, etc.
> > >
> > >
> > > I know that JSON can be used as a kind of intermediate representation
> in
> > > the context of Arrow with some language specific implementation.
> Current
> > > JSON integrations are insufficient to cover the most complex scenarios
> and
> > > are not standardized; e.g. support for most of the Arrow data type,
> various
> > > optimizations (string|binary dictionaries, multi-column sorting),
> batching,
> > > integration with Arrow IPC, compression ratio optimization, ... The
> object
> > > of this proposal is to progressively cover these gaps.
> > >
> > > I am looking to see if the community would be interested in such a
> > > contribution. Above are some additional details on the current
> > > implementation. All feedback is welcome.
> > >
> > > 10K ft overview of the current implementation:
> > >
> > >    1. Developers convert their row oriented stream into records based
> on
> > >    the Arrow Intermediate Representation (AIR). At this stage the
> > > translation
> > >    can be quite mechanical but if needed developers can decide for
> example
> > > to
> > >    translate a map into a struct if that makes sense for them. The
> current
> > >    implementation support the following arrow data types: bool, all
> uints,
> > > all
> > >    ints, all floats, string, binary, list of any supported types, and
> > > struct
> > >    of any supported types. Additional Arrow types could be added
> > > progressively.
> > >    2. The row oriented record (i.e. AIR record) is then added to a
> > >    RecordRepository. This repository will first compute a schema
> signature
> > > and
> > >    will route the record to a RecordBatcher based on this signature.
> > >    3. The RecordBatcher is responsible for collecting all the
> compatible
> > >    AIR records and, upon request, the "batcher" is able to build an
> Arrow
> > >    Record representing a batch of compatible inputs. In the current
> > >    implementation, the batcher is able to convert string columns to
> > > dictionary
> > >    based on a configuration. Another configuration allows to evaluate
> which
> > >    columns should be sorted to optimize the compression ratio. The same
> > >    optimization process could be applied to binary columns.
> > >    4. Steps 1 through 3 can be repeated on the same RecordRepository
> > >    instance to build new sets of arrow record batches. Subsequent
> > > iterations
> > >    will be slightly faster due to different techniques used (e.g.
> object
> > >    reuse, dictionary reuse and sorting, ...)
> > >
> > >
> > > The current Go implementation
> > > <https://github.com/lquerel/otel-arrow-adapter> (WIP) is currently
> part of
> > > this repo (see pkg/air package). If the community is interested, I
> could do
> > > a PR in the Arrow Go and Rust sub-projects.
> > >
>

-- 
Laurent Quérel

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Reply via email to