Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Micah Kornfield Fri, 29 Jul 2022 18:09:43 -0700

Please note this message and the previous one from the author violate our Code
of Conduct [1].  Specifically "Do not insult or put down other
participants."  Please try to be professional in communications and focus
on the technical issues at hand.


[1] https://www.apache.org/foundation/policies/conduct.html



On Thu, Jul 28, 2022 at 12:16 PM Benjamin Blodgett <
benjaminblodg...@gmail.com> wrote:

> He was trying to nicely say he knows way more than you, and your ideas
> will result in a low performance scheme no one will use in production
> ai/machine learning.
>
> Sent from my iPhone
>
> > On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett <
> benjaminblodg...@gmail.com> wrote:
> >
> > I think Jorge’s opinion has is that of an expert and him being humble
> is just being tactful.  Probably listen to Jorge on performance and
> architecture, even over Wes as he’s contributed more than anyone else and
> know the bleeding edge of low level performance stuff more than anyone.
> >
> > Sent from my iPhone
> >
> >> On Jul 28, 2022, at 12:03 PM, Laurent Quérel <laurent.que...@gmail.com>
> wrote:
> >>
> >> Hi Jorge
> >>
> >> I don't think that the level of in-depth knowledge needed is the same
> >> between using a row-oriented internal representation and "Arrow" which
> not
> >> only changes the organization of the data but also introduces a set of
> >> additional mapping choices and concepts.
> >>
> >> For example, assuming that the initial row-oriented data source is a
> stream
> >> of nested assembly of structures, lists and maps. The mapping of such a
> >> stream to Protobuf, JSON, YAML, ... is straightforward because on both
> >> sides the logical representation is exactly the same, the schema is
> >> sometimes optional, the interest of building batches is optional, ... In
> >> the case of "Arrow" things are different - the schema and the batching
> are
> >> mandatory. The mapping is not necessarily direct and will generally be
> the
> >> result of the combination of several trade-offs (normalization vs
> >> denormalization representation, mapping influencing the compression
> rate,
> >> queryability with Arrow processors like DataFusion, ...). Note that
> some of
> >> these complexities are not intrinsically linked to the fact that the
> target
> >> format is column oriented. The ZST format (
> >> https://zed.brimdata.io/docs/formats/zst/) for example does not
> require an
> >> explicit schema definition.
> >>
> >> IMHO, having a library that allows you to easily experiment with
> different
> >> types of mapping (without having to worry about batching, dictionaries,
> >> schema definition, understanding how lists of structs are represented,
> ...)
> >> and to evaluate the results according to your specific goals has a value
> >> (especially if your criteria are compression ratio and queryability). Of
> >> course there is an overhead to such an approach. In some cases, at the
> end
> >> of the process, it will be necessary to manually perform this direct
> >> transformation between a row-oriented XYZ format and "Arrow". However,
> this
> >> effort will be done after a simple experimentation phase to avoid
> changes
> >> in the implementation of the converter which in my opinion is not so
> simple
> >> to implement with the current Arrow API.
> >>
> >> If the Arrow developer community is not interested in integrating this
> >> proposal, I plan to release two independent libraries (Go and Rust) that
> >> can be used on top of the standard "Arrow" libraries. This will have the
> >> advantage to evaluate if such an approach is able to raise interest
> among
> >> Arrow users.
> >>
> >> Best,
> >>
> >> Laurent
> >>
> >>
> >>
> >>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão <
> >>> jorgecarlei...@gmail.com> wrote:
> >>>
> >>> Hi Laurent,
> >>>
> >>> I agree that there is a common pattern in converting row-based formats
> to
> >>> Arrow.
> >>>
> >>> Imho the difficult part is not to map the storage format to Arrow
> >>> specifically - it is to map the storage format to any in-memory (row-
> or
> >>> columnar- based) format, since it requires in-depth knowledge about
> the 2
> >>> formats (the source format and the target format).
> >>>
> >>> - Understanding the Arrow API which can be challenging for complex
> cases of
> >>>> rows representing complex objects (list of struct, struct of struct,
> >>> ...).
> >>>>
> >>>
> >>> the developer would have the same problem - just shifted around - they
> now
> >>> need to convert their complex objects to the intermediary
> representation.
> >>> Whether it is more "difficult" or "complex" to learn than Arrow is an
> open
> >>> question, but we would essentially be shifting the problem from
> "learning
> >>> Arrow" to "learning the Intermediate in-memory".
> >>>
> >>> @Micah Kornfield, as described before my goal is not to define a memory
> >>>> layout specification but more to define an API and a translation
> >>> mechanism
> >>>> able to take this intermediate representation (list of generic objects
> >>>> representing the entities to translate) and to convert it into one or
> >>> more
> >>>> Arrow records.
> >>>>
> >>>
> >>> imho a spec of "list of generic objects representing the entities" is
> >>> specified by an in-memory format (not by an API spec).
> >>>
> >>> A second challenge I anticipate is that in-memory formats inneerently
> "own"
> >>> the memory they outline (since by definition they outline how this
> memory
> >>> is outlined). An Intermediate in-memory representation would be no
> >>> different. Since row-based formats usually require at least one
> allocation
> >>> per row (and often more for variable-length types), the transformation
> >>> (storage format -> row-based in-memory format -> Arrow) incurs a
> >>> significant cost (~2x slower last time I played with this problem in
> JSON
> >>> [1]).
> >>>
> >>> A third challenge I anticipate is that given that we have 10+
> languages, we
> >>> would eventually need to convert the intermediary representation across
> >>> languages, which imo just hints that we would need to formalize an
> agnostic
> >>> spec for such representation (so languages agree on its
> representation),
> >>> and thus essentially declare a new (row-based) format.
> >>>
> >>> (none of this precludes efforts to invent an in-memory row format for
> >>> analytics workloads)
> >>>
> >>> @Wes McKinney <wesmck...@gmail.com>
> >>>
> >>> I still think having a canonical in-memory row format (and libraries
> >>>> to transform to and from Arrow columnar format) is a good idea — but
> >>>> there is the risk of ending up in the tar pit of reinventing Avro.
> >>>>
> >>>
> >>> afaik Avro does not have O(1) random access neither to its rows nor
> columns
> >>> - records are concatenated back to back, every record's column is
> >>> concatenated back to back within a record, and there is no indexing
> >>> information on how to access a particular row or column. There are
> blocks
> >>> of rows that reduce the cost of accessing large offsets, but imo it is
> far
> >>> from the O(1) offered by Arrow (and expected by analytics workloads).
> >>>
> >>> [1] https://github.com/jorgecarleitao/arrow2/pull/1024
> >>>
> >>> Best,
> >>> Jorge
> >>>
> >>> On Thu, Jul 28, 2022 at 5:38 AM Laurent Quérel <
> laurent.que...@gmail.com>
> >>> wrote:
> >>>
> >>>> Let me clarify the proposal a bit before replying to the various
> previous
> >>>> feedbacks.
> >>>>
> >>>>
> >>>>
> >>>> It seems to me that the process of converting a row-oriented data
> source
> >>>> (row = set of fields or something more hierarchical) into an Arrow
> record
> >>>> repeatedly raises the same challenges. A developer who must perform
> this
> >>>> kind of transformation is confronted with the following questions and
> >>>> problems:
> >>>>
> >>>> - Understanding the Arrow API which can be challenging for complex
> cases
> >>> of
> >>>> rows representing complex objects (list of struct, struct of struct,
> >>> ...).
> >>>>
> >>>> - Decide which Arrow schema(s) will correspond to your data source. In
> >>> some
> >>>> complex cases it can be advantageous to translate the same
> row-oriented
> >>>> data source into several Arrow schemas (e.g. OpenTelementry data
> >>> sources).
> >>>>
> >>>> - Decide on the encoding of the columns to make the most of the
> >>>> column-oriented format and thus increase the compression rate (e.g.
> >>> define
> >>>> the columns that should be represent as dictionaries).
> >>>>
> >>>>
> >>>>
> >>>> By experience, I can attest that this process is usually iterative.
> For
> >>>> non-trivial data sources, arriving at the arrow representation that
> >>> offers
> >>>> the best compression ratio and is still perfectly usable and queryable
> >>> is a
> >>>> long and tedious process.
> >>>>
> >>>>
> >>>>
> >>>> I see two approaches to ease this process and consequently increase
> the
> >>>> adoption of Apache Arrow:
> >>>>
> >>>> - Definition of a canonical in-memory row format specification that
> every
> >>>> row-oriented data source provider can progressively adopt to get an
> >>>> automatic translation into the Arrow format.
> >>>>
> >>>> - Definition of an integration library allowing to map any
> row-oriented
> >>>> source into a generic row-oriented source understood by the
> converter. It
> >>>> is not about defining a unique in-memory format but more about
> defining a
> >>>> standard API to represent row-oriented data.
> >>>>
> >>>>
> >>>>
> >>>> In my opinion these two approaches are complementary. The first option
> >>> is a
> >>>> long-term approach targeting directly the data providers, which will
> >>>> require to agree on this generic row-oriented format and whose
> adoption
> >>>> will be more or less long. The second approach does not directly
> require
> >>>> the collaboration of data source providers but allows an "integrator"
> to
> >>>> perform this transformation painlessly with potentially several
> >>>> representation trials to achieve the best results in his context.
> >>>>
> >>>>
> >>>>
> >>>> The current proposal is an implementation of the second approach,
> i.e. an
> >>>> API that maps a row-oriented source XYZ into an intermediate
> row-oriented
> >>>> representation understood mechanically by the translator. This
> translator
> >>>> also adds a series of optimizations to make the most of the Arrow
> format.
> >>>>
> >>>>
> >>>>
> >>>> You can find multiple examples of a such transformation in the
> following
> >>>> examples:
> >>>>
> >>>>  -
> >>>>
> >>>>
> >>>
> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/trace/otlp_to_arrow.go
> >>>>  this example converts OTEL trace entities into their corresponding
> >>> Arrow
> >>>>  IR. At the end of this conversion the method returns a collection of
> >>>> Arrow
> >>>>  Records.
> >>>>  - A more complex example can be found here
> >>>>
> >>>>
> >>>
> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/metrics/otlp_to_arrow.go
> >>>> .
> >>>>  In this example a stream of OTEL univariate row-oriented metrics are
> >>>>  translate into multivariate row-oriented metrics and then
> >>> automatically
> >>>>  translated into Apache Records.
> >>>>
> >>>>
> >>>>
> >>>> In these two examples, the creation of dictionaries and multi-column
> >>>> sorting is automatically done by the framework and the developer
> doesn’t
> >>>> have to worry about the definition of Arrow schemas.
> >>>>
> >>>>
> >>>>
> >>>> Now let's get to the answers.
> >>>>
> >>>>
> >>>>
> >>>> @David Lee, I don't think Parquet and from_pylist() solve this problem
> >>>> particularly well. Parquet is a column-oriented data file format and
> >>>> doesn't really help to perform this transformation. The Python method
> is
> >>>> relatively limited and language specific.
> >>>>
> >>>>
> >>>>
> >>>> @Micah Kornfield, as described before my goal is not to define a
> memory
> >>>> layout specification but more to define an API and a translation
> >>> mechanism
> >>>> able to take this intermediate representation (list of generic objects
> >>>> representing the entities to translate) and to convert it into one or
> >>> more
> >>>> Arrow records.
> >>>>
> >>>>
> >>>>
> >>>> @Wes McKinney, If I interpret your answer correctly, I think you are
> >>>> describing the option 1 mentioned above. Like you I think it is an
> >>>> interesting approach although complementary to the one I propose.
> >>>>
> >>>>
> >>>>
> >>>> Looking forward to your feedback.
> >>>>
> >>>> On Wed, Jul 27, 2022 at 4:19 PM Wes McKinney <wesmck...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> We had an e-mail thread about this in 2018
> >>>>>
> >>>>> https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm
> >>>>>
> >>>>> I still think having a canonical in-memory row format (and libraries
> >>>>> to transform to and from Arrow columnar format) is a good idea — but
> >>>>> there is the risk of ending up in the tar pit of reinventing Avro.
> >>>>>
> >>>>>
> >>>>> On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield <
> emkornfi...@gmail.com
> >>>>
> >>>>> wrote:
> >>>>>>
> >>>>>> Are there more details on what exactly an "Arrow Intermediate
> >>>>>> Representation (AIR)" is?  We've talked about in the past maybe
> >>> having
> >>>> a
> >>>>>> memory layout specification for row-based data as well as column
> >>> based
> >>>>>> data.  There was also a recent attempt at least in C++ to try to
> >>> build
> >>>>>> utilities to do these pivots but it was decided that it didn't add
> >>> much
> >>>>>> utility (it was added a comprehensive example).
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Micah
> >>>>>>
> >>>>>> On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel <
> >>>> laurent.que...@gmail.com
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> In the context of this OTEP
> >>>>>>> <
> >>>>>
> >>>>
> >>>
> https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
> >>>>>>>>
> >>>>>>> (OpenTelemetry
> >>>>>>> Enhancement Proposal) I developed an integration layer on top of
> >>>> Apache
> >>>>>>> Arrow (Go an Rust) to *facilitate the translation of row-oriented
> >>>> data
> >>>>>>> stream into an arrow-based columnar representation*. In this
> >>>> particular
> >>>>>>> case the goal was to translate all OpenTelemetry entities (metrics,
> >>>>> logs,
> >>>>>>> or traces) into Apache Arrow records. These entities can be quite
> >>>>> complex
> >>>>>>> and their corresponding Arrow schema must be defined on the fly.
> >>> IMO,
> >>>>> this
> >>>>>>> approach is not specific to my specific needs but could be used in
> >>>> many
> >>>>>>> other contexts where there is a need to simplify the integration
> >>>>> between a
> >>>>>>> row-oriented source of data and Apache Arrow. The trade-off is to
> >>>> have
> >>>>> to
> >>>>>>> perform the additional step of conversion to the intermediate
> >>>>>>> representation, but this transformation does not require to
> >>>> understand
> >>>>> the
> >>>>>>> arcana of the Arrow format and allows to potentially benefit from
> >>>>>>> functionalities such as the encoding of the dictionary "for free",
> >>>> the
> >>>>>>> automatic generation of Arrow schemas, the batching, the
> >>> multi-column
> >>>>>>> sorting, etc.
> >>>>>>>
> >>>>>>>
> >>>>>>> I know that JSON can be used as a kind of intermediate
> >>> representation
> >>>>> in
> >>>>>>> the context of Arrow with some language specific implementation.
> >>>>> Current
> >>>>>>> JSON integrations are insufficient to cover the most complex
> >>>> scenarios
> >>>>> and
> >>>>>>> are not standardized; e.g. support for most of the Arrow data type,
> >>>>> various
> >>>>>>> optimizations (string|binary dictionaries, multi-column sorting),
> >>>>> batching,
> >>>>>>> integration with Arrow IPC, compression ratio optimization, ... The
> >>>>> object
> >>>>>>> of this proposal is to progressively cover these gaps.
> >>>>>>>
> >>>>>>> I am looking to see if the community would be interested in such a
> >>>>>>> contribution. Above are some additional details on the current
> >>>>>>> implementation. All feedback is welcome.
> >>>>>>>
> >>>>>>> 10K ft overview of the current implementation:
> >>>>>>>
> >>>>>>>  1. Developers convert their row oriented stream into records
> >>> based
> >>>>> on
> >>>>>>>  the Arrow Intermediate Representation (AIR). At this stage the
> >>>>>>> translation
> >>>>>>>  can be quite mechanical but if needed developers can decide for
> >>>>> example
> >>>>>>> to
> >>>>>>>  translate a map into a struct if that makes sense for them. The
> >>>>> current
> >>>>>>>  implementation support the following arrow data types: bool, all
> >>>>> uints,
> >>>>>>> all
> >>>>>>>  ints, all floats, string, binary, list of any supported types,
> >>> and
> >>>>>>> struct
> >>>>>>>  of any supported types. Additional Arrow types could be added
> >>>>>>> progressively.
> >>>>>>>  2. The row oriented record (i.e. AIR record) is then added to a
> >>>>>>>  RecordRepository. This repository will first compute a schema
> >>>>> signature
> >>>>>>> and
> >>>>>>>  will route the record to a RecordBatcher based on this
> >>> signature.
> >>>>>>>  3. The RecordBatcher is responsible for collecting all the
> >>>>> compatible
> >>>>>>>  AIR records and, upon request, the "batcher" is able to build an
> >>>>> Arrow
> >>>>>>>  Record representing a batch of compatible inputs. In the current
> >>>>>>>  implementation, the batcher is able to convert string columns to
> >>>>>>> dictionary
> >>>>>>>  based on a configuration. Another configuration allows to
> >>> evaluate
> >>>>> which
> >>>>>>>  columns should be sorted to optimize the compression ratio. The
> >>>> same
> >>>>>>>  optimization process could be applied to binary columns.
> >>>>>>>  4. Steps 1 through 3 can be repeated on the same
> >>> RecordRepository
> >>>>>>>  instance to build new sets of arrow record batches. Subsequent
> >>>>>>> iterations
> >>>>>>>  will be slightly faster due to different techniques used (e.g.
> >>>>> object
> >>>>>>>  reuse, dictionary reuse and sorting, ...)
> >>>>>>>
> >>>>>>>
> >>>>>>> The current Go implementation
> >>>>>>> <https://github.com/lquerel/otel-arrow-adapter> (WIP) is currently
> >>>>> part of
> >>>>>>> this repo (see pkg/air package). If the community is interested, I
> >>>>> could do
> >>>>>>> a PR in the Arrow Go and Rust sub-projects.
> >>>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Laurent Quérel
> >>>>
> >>>
> >>
> >>
> >> --
> >> Laurent Quérel
>

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Reply via email to