Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Laurent Quérel Thu, 28 Jul 2022 23:15:52 -0700

Hi Gavin,
I was not aware of this initiative but indeed, these two proposals have
much in common. The implementation I am working on is available here
https://github.com/lquerel/otel-arrow-adapter (directory pkg/air). I would
be happy to get your feedback and identify with you the possible gaps to
cover your specific use case.
Best,
Laurent


On Thu, Jul 28, 2022 at 5:43 PM Gavin Ray <[email protected]> wrote:

> This is essentially the same idea as the proposal here I think --
> row/map-based representation & conversion functions for ease of use:
>
> [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry,
> increase adoption/audience and productivity. · Issue #12618 · apache/arrow
> (github.com) <https://github.com/apache/arrow/issues/12618>
>
> Definitely a worthwhile pursuit IMO.
>
> On Thu, Jul 28, 2022 at 4:46 PM Sasha Krassovsky <
> [email protected]>
> wrote:
>
> > Hi everyone,
> > I just wanted to chime in that we already do have a form of row-oriented
> > storage inside of `arrow/compute/row/row_internal.h`. It is used to store
> > rows inside of GroupBy and Join within Acero. We also have utilities for
> > converting to/from columnar storage (and AVX2 implementations of these
> > conversions) inside of `arrow/compute/row/encode_internal.h`. Would it be
> > useful to standardize this row-oriented format?
> >
> > As far as I understand fixed-width rows would be trivially convertible
> > into this representation (just a pointer to your array of structs), while
> > variable-width rows would need a little bit of massaging (though not too
> > much) to be put into this representation.
> >
> > Sasha Krassovsky
> >
> > > On Jul 28, 2022, at 1:10 PM, Laurent Quérel <[email protected]>
> > wrote:
> > >
> > > Thank you Micah for a very clear summary of the intent behind this
> > > proposal. Indeed, I think that clarifying from the beginning that this
> > > approach aims at facilitating experimentation more than efficiency in
> > terms
> > > of performance of the transformation phase would have helped to better
> > > understand my objective.
> > >
> > > Regarding your question, I don't think there is a specific technical
> > reason
> > > for such an integration in the core library. I was just thinking that
> it
> > > would make this infrastructure easier to find for the users and that
> this
> > > topic was general enough to find its place in the standard library.
> > >
> > > Best,
> > > Laurent
> > >
> > > On Thu, Jul 28, 2022 at 12:50 PM Micah Kornfield <
> [email protected]>
> > > wrote:
> > >
> > >> Hi Laurent,
> > >> I'm retitling this thread to include the specific languages you seem
> to
> > be
> > >> targeting in the subject line to hopefully get more eyes from
> > maintainers
> > >> in those languages.
> > >>
> > >> Thanks for clarifying the goals.  If I can restate my understanding,
> the
> > >> intended use-case here is to provide easy (from the developer point of
> > >> view) adaptation of row based formats to Arrow.  The means of
> achieving
> > >> this is creating an API for a row-base structure, and having utility
> > >> classes that can manipulate the interface to build up batches (there
> > are no
> > >> serialization or in memory spec associated with this API).  People
> > wishing
> > >> to integrate a specific row based format, can extend that API at
> > whatever
> > >> level makes sense for the format.
> > >>
> > >> I think this would be useful infrastructure as long as it was made
> clear
> > >> that in many cases this wouldn't be the most efficient way to convert
> to
> > >> Arrow from other formats.
> > >>
> > >> I don't work much with either the Rust or Go implementation, so I
> can't
> > >> speak to if there is maintainer support for incorporating the changes
> > >> directly in Arrow.  Is there any technical reasons for preferring to
> > have
> > >> this included directly in Arrow vs a separate library?
> > >>
> > >> Cheers,
> > >> Micah
> > >>
> > >> On Thu, Jul 28, 2022 at 12:34 PM Laurent Quérel <
> > [email protected]>
> > >> wrote:
> > >>
> > >>> Far be it from me to think that I know more than Jorge or Wes on this
> > >>> subject. Sorry if my post gives that perception, that is clearly not
> my
> > >>> intention. I'm just trying to defend the idea that when designing
> this
> > >> kind
> > >>> of transformation, it might be interesting to have a library to test
> > >>> several mappings and evaluate them before doing a more direct
> > >>> implementation if the performance is not there.
> > >>>
> > >>> On Thu, Jul 28, 2022 at 12:15 PM Benjamin Blodgett <
> > >>> [email protected]> wrote:
> > >>>
> > >>>> He was trying to nicely say he knows way more than you, and your
> ideas
> > >>>> will result in a low performance scheme no one will use in
> production
> > >>>> ai/machine learning.
> > >>>>
> > >>>> Sent from my iPhone
> > >>>>
> > >>>>> On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett <
> > >>>> [email protected]> wrote:
> > >>>>>
> > >>>>> I think Jorge’s opinion has is that of an expert and him being
> > >> humble
> > >>>> is just being tactful.  Probably listen to Jorge on performance and
> > >>>> architecture, even over Wes as he’s contributed more than anyone
> else
> > >> and
> > >>>> know the bleeding edge of low level performance stuff more than
> > anyone.
> > >>>>>
> > >>>>> Sent from my iPhone
> > >>>>>
> > >>>>>> On Jul 28, 2022, at 12:03 PM, Laurent Quérel <
> > >>> [email protected]>
> > >>>> wrote:
> > >>>>>>
> > >>>>>> Hi Jorge
> > >>>>>>
> > >>>>>> I don't think that the level of in-depth knowledge needed is the
> > >> same
> > >>>>>> between using a row-oriented internal representation and "Arrow"
> > >> which
> > >>>> not
> > >>>>>> only changes the organization of the data but also introduces a
> set
> > >> of
> > >>>>>> additional mapping choices and concepts.
> > >>>>>>
> > >>>>>> For example, assuming that the initial row-oriented data source
> is a
> > >>>> stream
> > >>>>>> of nested assembly of structures, lists and maps. The mapping of
> > >> such
> > >>> a
> > >>>>>> stream to Protobuf, JSON, YAML, ... is straightforward because on
> > >> both
> > >>>>>> sides the logical representation is exactly the same, the schema
> is
> > >>>>>> sometimes optional, the interest of building batches is optional,
> > >> ...
> > >>> In
> > >>>>>> the case of "Arrow" things are different - the schema and the
> > >> batching
> > >>>> are
> > >>>>>> mandatory. The mapping is not necessarily direct and will
> generally
> > >> be
> > >>>> the
> > >>>>>> result of the combination of several trade-offs (normalization vs
> > >>>>>> denormalization representation, mapping influencing the
> compression
> > >>>> rate,
> > >>>>>> queryability with Arrow processors like DataFusion, ...). Note
> that
> > >>>> some of
> > >>>>>> these complexities are not intrinsically linked to the fact that
> the
> > >>>> target
> > >>>>>> format is column oriented. The ZST format (
> > >>>>>> https://zed.brimdata.io/docs/formats/zst/) for example does not
> > >>>> require an
> > >>>>>> explicit schema definition.
> > >>>>>>
> > >>>>>> IMHO, having a library that allows you to easily experiment with
> > >>>> different
> > >>>>>> types of mapping (without having to worry about batching,
> > >>> dictionaries,
> > >>>>>> schema definition, understanding how lists of structs are
> > >> represented,
> > >>>> ...)
> > >>>>>> and to evaluate the results according to your specific goals has a
> > >>> value
> > >>>>>> (especially if your criteria are compression ratio and
> > >> queryability).
> > >>> Of
> > >>>>>> course there is an overhead to such an approach. In some cases, at
> > >> the
> > >>>> end
> > >>>>>> of the process, it will be necessary to manually perform this
> direct
> > >>>>>> transformation between a row-oriented XYZ format and "Arrow".
> > >> However,
> > >>>> this
> > >>>>>> effort will be done after a simple experimentation phase to avoid
> > >>>> changes
> > >>>>>> in the implementation of the converter which in my opinion is not
> so
> > >>>> simple
> > >>>>>> to implement with the current Arrow API.
> > >>>>>>
> > >>>>>> If the Arrow developer community is not interested in integrating
> > >> this
> > >>>>>> proposal, I plan to release two independent libraries (Go and
> Rust)
> > >>> that
> > >>>>>> can be used on top of the standard "Arrow" libraries. This will
> have
> > >>> the
> > >>>>>> advantage to evaluate if such an approach is able to raise
> interest
> > >>>> among
> > >>>>>> Arrow users.
> > >>>>>>
> > >>>>>> Best,
> > >>>>>>
> > >>>>>> Laurent
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão <
> > >>>>>>> [email protected]> wrote:
> > >>>>>>>
> > >>>>>>> Hi Laurent,
> > >>>>>>>
> > >>>>>>> I agree that there is a common pattern in converting row-based
> > >>> formats
> > >>>> to
> > >>>>>>> Arrow.
> > >>>>>>>
> > >>>>>>> Imho the difficult part is not to map the storage format to Arrow
> > >>>>>>> specifically - it is to map the storage format to any in-memory
> > >> (row-
> > >>>> or
> > >>>>>>> columnar- based) format, since it requires in-depth knowledge
> about
> > >>>> the 2
> > >>>>>>> formats (the source format and the target format).
> > >>>>>>>
> > >>>>>>> - Understanding the Arrow API which can be challenging for
> complex
> > >>>> cases of
> > >>>>>>>> rows representing complex objects (list of struct, struct of
> > >> struct,
> > >>>>>>> ...).
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> the developer would have the same problem - just shifted around -
> > >>> they
> > >>>> now
> > >>>>>>> need to convert their complex objects to the intermediary
> > >>>> representation.
> > >>>>>>> Whether it is more "difficult" or "complex" to learn than Arrow
> is
> > >> an
> > >>>> open
> > >>>>>>> question, but we would essentially be shifting the problem from
> > >>>> "learning
> > >>>>>>> Arrow" to "learning the Intermediate in-memory".
> > >>>>>>>
> > >>>>>>> @Micah Kornfield, as described before my goal is not to define a
> > >>> memory
> > >>>>>>>> layout specification but more to define an API and a translation
> > >>>>>>> mechanism
> > >>>>>>>> able to take this intermediate representation (list of generic
> > >>> objects
> > >>>>>>>> representing the entities to translate) and to convert it into
> one
> > >>> or
> > >>>>>>> more
> > >>>>>>>> Arrow records.
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> imho a spec of "list of generic objects representing the
> entities"
> > >> is
> > >>>>>>> specified by an in-memory format (not by an API spec).
> > >>>>>>>
> > >>>>>>> A second challenge I anticipate is that in-memory formats
> > >> inneerently
> > >>>> "own"
> > >>>>>>> the memory they outline (since by definition they outline how
> this
> > >>>> memory
> > >>>>>>> is outlined). An Intermediate in-memory representation would be
> no
> > >>>>>>> different. Since row-based formats usually require at least one
> > >>>> allocation
> > >>>>>>> per row (and often more for variable-length types), the
> > >>> transformation
> > >>>>>>> (storage format -> row-based in-memory format -> Arrow) incurs a
> > >>>>>>> significant cost (~2x slower last time I played with this problem
> > >> in
> > >>>> JSON
> > >>>>>>> [1]).
> > >>>>>>>
> > >>>>>>> A third challenge I anticipate is that given that we have 10+
> > >>>> languages, we
> > >>>>>>> would eventually need to convert the intermediary representation
> > >>> across
> > >>>>>>> languages, which imo just hints that we would need to formalize
> an
> > >>>> agnostic
> > >>>>>>> spec for such representation (so languages agree on its
> > >>>> representation),
> > >>>>>>> and thus essentially declare a new (row-based) format.
> > >>>>>>>
> > >>>>>>> (none of this precludes efforts to invent an in-memory row format
> > >> for
> > >>>>>>> analytics workloads)
> > >>>>>>>
> > >>>>>>> @Wes McKinney <[email protected]>
> > >>>>>>>
> > >>>>>>> I still think having a canonical in-memory row format (and
> > >> libraries
> > >>>>>>>> to transform to and from Arrow columnar format) is a good idea —
> > >> but
> > >>>>>>>> there is the risk of ending up in the tar pit of reinventing
> Avro.
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> afaik Avro does not have O(1) random access neither to its rows
> nor
> > >>>> columns
> > >>>>>>> - records are concatenated back to back, every record's column is
> > >>>>>>> concatenated back to back within a record, and there is no
> indexing
> > >>>>>>> information on how to access a particular row or column. There
> are
> > >>>> blocks
> > >>>>>>> of rows that reduce the cost of accessing large offsets, but imo
> it
> > >>> is
> > >>>> far
> > >>>>>>> from the O(1) offered by Arrow (and expected by analytics
> > >> workloads).
> > >>>>>>>
> > >>>>>>> [1] https://github.com/jorgecarleitao/arrow2/pull/1024
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Jorge
> > >>>>>>>
> > >>>>>>> On Thu, Jul 28, 2022 at 5:38 AM Laurent Quérel <
> > >>>> [email protected]>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Let me clarify the proposal a bit before replying to the various
> > >>>> previous
> > >>>>>>>> feedbacks.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> It seems to me that the process of converting a row-oriented
> data
> > >>>> source
> > >>>>>>>> (row = set of fields or something more hierarchical) into an
> Arrow
> > >>>> record
> > >>>>>>>> repeatedly raises the same challenges. A developer who must
> > >> perform
> > >>>> this
> > >>>>>>>> kind of transformation is confronted with the following
> questions
> > >>> and
> > >>>>>>>> problems:
> > >>>>>>>>
> > >>>>>>>> - Understanding the Arrow API which can be challenging for
> complex
> > >>>> cases
> > >>>>>>> of
> > >>>>>>>> rows representing complex objects (list of struct, struct of
> > >> struct,
> > >>>>>>> ...).
> > >>>>>>>>
> > >>>>>>>> - Decide which Arrow schema(s) will correspond to your data
> > >> source.
> > >>> In
> > >>>>>>> some
> > >>>>>>>> complex cases it can be advantageous to translate the same
> > >>>> row-oriented
> > >>>>>>>> data source into several Arrow schemas (e.g. OpenTelementry data
> > >>>>>>> sources).
> > >>>>>>>>
> > >>>>>>>> - Decide on the encoding of the columns to make the most of the
> > >>>>>>>> column-oriented format and thus increase the compression rate
> > >> (e.g.
> > >>>>>>> define
> > >>>>>>>> the columns that should be represent as dictionaries).
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> By experience, I can attest that this process is usually
> > >> iterative.
> > >>>> For
> > >>>>>>>> non-trivial data sources, arriving at the arrow representation
> > >> that
> > >>>>>>> offers
> > >>>>>>>> the best compression ratio and is still perfectly usable and
> > >>> queryable
> > >>>>>>> is a
> > >>>>>>>> long and tedious process.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> I see two approaches to ease this process and consequently
> > >> increase
> > >>>> the
> > >>>>>>>> adoption of Apache Arrow:
> > >>>>>>>>
> > >>>>>>>> - Definition of a canonical in-memory row format specification
> > >> that
> > >>>> every
> > >>>>>>>> row-oriented data source provider can progressively adopt to get
> > >> an
> > >>>>>>>> automatic translation into the Arrow format.
> > >>>>>>>>
> > >>>>>>>> - Definition of an integration library allowing to map any
> > >>>> row-oriented
> > >>>>>>>> source into a generic row-oriented source understood by the
> > >>>> converter. It
> > >>>>>>>> is not about defining a unique in-memory format but more about
> > >>>> defining a
> > >>>>>>>> standard API to represent row-oriented data.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> In my opinion these two approaches are complementary. The first
> > >>> option
> > >>>>>>> is a
> > >>>>>>>> long-term approach targeting directly the data providers, which
> > >> will
> > >>>>>>>> require to agree on this generic row-oriented format and whose
> > >>>> adoption
> > >>>>>>>> will be more or less long. The second approach does not directly
> > >>>> require
> > >>>>>>>> the collaboration of data source providers but allows an
> > >>> "integrator"
> > >>>> to
> > >>>>>>>> perform this transformation painlessly with potentially several
> > >>>>>>>> representation trials to achieve the best results in his
> context.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> The current proposal is an implementation of the second
> approach,
> > >>>> i.e. an
> > >>>>>>>> API that maps a row-oriented source XYZ into an intermediate
> > >>>> row-oriented
> > >>>>>>>> representation understood mechanically by the translator. This
> > >>>> translator
> > >>>>>>>> also adds a series of optimizations to make the most of the
> Arrow
> > >>>> format.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> You can find multiple examples of a such transformation in the
> > >>>> following
> > >>>>>>>> examples:
> > >>>>>>>>
> > >>>>>>>> -
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>
> > >>>
> > >>
> >
> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/trace/otlp_to_arrow.go
> > >>>>>>>> this example converts OTEL trace entities into their
> > >> corresponding
> > >>>>>>> Arrow
> > >>>>>>>> IR. At the end of this conversion the method returns a
> collection
> > >>> of
> > >>>>>>>> Arrow
> > >>>>>>>> Records.
> > >>>>>>>> - A more complex example can be found here
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>
> > >>>
> > >>
> >
> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/metrics/otlp_to_arrow.go
> > >>>>>>>> .
> > >>>>>>>> In this example a stream of OTEL univariate row-oriented metrics
> > >>> are
> > >>>>>>>> translate into multivariate row-oriented metrics and then
> > >>>>>>> automatically
> > >>>>>>>> translated into Apache Records.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> In these two examples, the creation of dictionaries and
> > >> multi-column
> > >>>>>>>> sorting is automatically done by the framework and the developer
> > >>>> doesn’t
> > >>>>>>>> have to worry about the definition of Arrow schemas.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Now let's get to the answers.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> @David Lee, I don't think Parquet and from_pylist() solve this
> > >>> problem
> > >>>>>>>> particularly well. Parquet is a column-oriented data file format
> > >> and
> > >>>>>>>> doesn't really help to perform this transformation. The Python
> > >>> method
> > >>>> is
> > >>>>>>>> relatively limited and language specific.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> @Micah Kornfield, as described before my goal is not to define a
> > >>>> memory
> > >>>>>>>> layout specification but more to define an API and a translation
> > >>>>>>> mechanism
> > >>>>>>>> able to take this intermediate representation (list of generic
> > >>> objects
> > >>>>>>>> representing the entities to translate) and to convert it into
> one
> > >>> or
> > >>>>>>> more
> > >>>>>>>> Arrow records.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> @Wes McKinney, If I interpret your answer correctly, I think you
> > >> are
> > >>>>>>>> describing the option 1 mentioned above. Like you I think it is
> an
> > >>>>>>>> interesting approach although complementary to the one I
> propose.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Looking forward to your feedback.
> > >>>>>>>>
> > >>>>>>>> On Wed, Jul 27, 2022 at 4:19 PM Wes McKinney <
> [email protected]
> > >>>
> > >>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> We had an e-mail thread about this in 2018
> > >>>>>>>>>
> > >>>>>>>>>
> https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm
> > >>>>>>>>>
> > >>>>>>>>> I still think having a canonical in-memory row format (and
> > >>> libraries
> > >>>>>>>>> to transform to and from Arrow columnar format) is a good idea
> —
> > >>> but
> > >>>>>>>>> there is the risk of ending up in the tar pit of reinventing
> > >> Avro.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield <
> > >>>> [email protected]
> > >>>>>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Are there more details on what exactly an "Arrow Intermediate
> > >>>>>>>>>> Representation (AIR)" is?  We've talked about in the past
> maybe
> > >>>>>>> having
> > >>>>>>>> a
> > >>>>>>>>>> memory layout specification for row-based data as well as
> column
> > >>>>>>> based
> > >>>>>>>>>> data.  There was also a recent attempt at least in C++ to try
> to
> > >>>>>>> build
> > >>>>>>>>>> utilities to do these pivots but it was decided that it didn't
> > >> add
> > >>>>>>> much
> > >>>>>>>>>> utility (it was added a comprehensive example).
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks,
> > >>>>>>>>>> Micah
> > >>>>>>>>>>
> > >>>>>>>>>> On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel <
> > >>>>>>>> [email protected]
> > >>>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> In the context of this OTEP
> > >>>>>>>>>>> <
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>
> > >>>
> > >>
> >
> https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
> > >>>>>>>>>>>>
> > >>>>>>>>>>> (OpenTelemetry
> > >>>>>>>>>>> Enhancement Proposal) I developed an integration layer on top
> > >> of
> > >>>>>>>> Apache
> > >>>>>>>>>>> Arrow (Go an Rust) to *facilitate the translation of
> > >> row-oriented
> > >>>>>>>> data
> > >>>>>>>>>>> stream into an arrow-based columnar representation*. In this
> > >>>>>>>> particular
> > >>>>>>>>>>> case the goal was to translate all OpenTelemetry entities
> > >>> (metrics,
> > >>>>>>>>> logs,
> > >>>>>>>>>>> or traces) into Apache Arrow records. These entities can be
> > >> quite
> > >>>>>>>>> complex
> > >>>>>>>>>>> and their corresponding Arrow schema must be defined on the
> > >> fly.
> > >>>>>>> IMO,
> > >>>>>>>>> this
> > >>>>>>>>>>> approach is not specific to my specific needs but could be
> used
> > >>> in
> > >>>>>>>> many
> > >>>>>>>>>>> other contexts where there is a need to simplify the
> > >> integration
> > >>>>>>>>> between a
> > >>>>>>>>>>> row-oriented source of data and Apache Arrow. The trade-off
> is
> > >> to
> > >>>>>>>> have
> > >>>>>>>>> to
> > >>>>>>>>>>> perform the additional step of conversion to the intermediate
> > >>>>>>>>>>> representation, but this transformation does not require to
> > >>>>>>>> understand
> > >>>>>>>>> the
> > >>>>>>>>>>> arcana of the Arrow format and allows to potentially benefit
> > >> from
> > >>>>>>>>>>> functionalities such as the encoding of the dictionary "for
> > >>> free",
> > >>>>>>>> the
> > >>>>>>>>>>> automatic generation of Arrow schemas, the batching, the
> > >>>>>>> multi-column
> > >>>>>>>>>>> sorting, etc.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> I know that JSON can be used as a kind of intermediate
> > >>>>>>> representation
> > >>>>>>>>> in
> > >>>>>>>>>>> the context of Arrow with some language specific
> > >> implementation.
> > >>>>>>>>> Current
> > >>>>>>>>>>> JSON integrations are insufficient to cover the most complex
> > >>>>>>>> scenarios
> > >>>>>>>>> and
> > >>>>>>>>>>> are not standardized; e.g. support for most of the Arrow data
> > >>> type,
> > >>>>>>>>> various
> > >>>>>>>>>>> optimizations (string|binary dictionaries, multi-column
> > >> sorting),
> > >>>>>>>>> batching,
> > >>>>>>>>>>> integration with Arrow IPC, compression ratio optimization,
> ...
> > >>> The
> > >>>>>>>>> object
> > >>>>>>>>>>> of this proposal is to progressively cover these gaps.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I am looking to see if the community would be interested in
> > >> such
> > >>> a
> > >>>>>>>>>>> contribution. Above are some additional details on the
> current
> > >>>>>>>>>>> implementation. All feedback is welcome.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 10K ft overview of the current implementation:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. Developers convert their row oriented stream into records
> > >>>>>>> based
> > >>>>>>>>> on
> > >>>>>>>>>>> the Arrow Intermediate Representation (AIR). At this stage
> the
> > >>>>>>>>>>> translation
> > >>>>>>>>>>> can be quite mechanical but if needed developers can decide
> > >> for
> > >>>>>>>>> example
> > >>>>>>>>>>> to
> > >>>>>>>>>>> translate a map into a struct if that makes sense for them.
> > >> The
> > >>>>>>>>> current
> > >>>>>>>>>>> implementation support the following arrow data types: bool,
> > >> all
> > >>>>>>>>> uints,
> > >>>>>>>>>>> all
> > >>>>>>>>>>> ints, all floats, string, binary, list of any supported
> types,
> > >>>>>>> and
> > >>>>>>>>>>> struct
> > >>>>>>>>>>> of any supported types. Additional Arrow types could be added
> > >>>>>>>>>>> progressively.
> > >>>>>>>>>>> 2. The row oriented record (i.e. AIR record) is then added to
> > >> a
> > >>>>>>>>>>> RecordRepository. This repository will first compute a schema
> > >>>>>>>>> signature
> > >>>>>>>>>>> and
> > >>>>>>>>>>> will route the record to a RecordBatcher based on this
> > >>>>>>> signature.
> > >>>>>>>>>>> 3. The RecordBatcher is responsible for collecting all the
> > >>>>>>>>> compatible
> > >>>>>>>>>>> AIR records and, upon request, the "batcher" is able to build
> > >> an
> > >>>>>>>>> Arrow
> > >>>>>>>>>>> Record representing a batch of compatible inputs. In the
> > >> current
> > >>>>>>>>>>> implementation, the batcher is able to convert string columns
> > >> to
> > >>>>>>>>>>> dictionary
> > >>>>>>>>>>> based on a configuration. Another configuration allows to
> > >>>>>>> evaluate
> > >>>>>>>>> which
> > >>>>>>>>>>> columns should be sorted to optimize the compression ratio.
> > >> The
> > >>>>>>>> same
> > >>>>>>>>>>> optimization process could be applied to binary columns.
> > >>>>>>>>>>> 4. Steps 1 through 3 can be repeated on the same
> > >>>>>>> RecordRepository
> > >>>>>>>>>>> instance to build new sets of arrow record batches.
> Subsequent
> > >>>>>>>>>>> iterations
> > >>>>>>>>>>> will be slightly faster due to different techniques used
> (e.g.
> > >>>>>>>>> object
> > >>>>>>>>>>> reuse, dictionary reuse and sorting, ...)
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> The current Go implementation
> > >>>>>>>>>>> <https://github.com/lquerel/otel-arrow-adapter> (WIP) is
> > >>> currently
> > >>>>>>>>> part of
> > >>>>>>>>>>> this repo (see pkg/air package). If the community is
> > >> interested,
> > >>> I
> > >>>>>>>>> could do
> > >>>>>>>>>>> a PR in the Arrow Go and Rust sub-projects.
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Laurent Quérel
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Laurent Quérel
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Laurent Quérel
> > >>>
> > >>
> > >
> > >
> > > --
> > > Laurent Quérel
> >
> >
>


-- 
Laurent Quérel

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Reply via email to