Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Benjamin Blodgett Thu, 28 Jul 2022 12:15:57 -0700

He was trying to nicely say he knows way more than you, and your ideas will 
result in a low performance scheme no one will use in production ai/machine 
learning.


Sent from my iPhone

> On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett <benjaminblodg...@gmail.com> 
> wrote:
> 
> I think Jorge’s opinion has is that of an expert and him being humble is 
> just being tactful.  Probably listen to Jorge on performance and 
> architecture, even over Wes as he’s contributed more than anyone else and 
> know the bleeding edge of low level performance stuff more than anyone.  
> 
> Sent from my iPhone
> 
>> On Jul 28, 2022, at 12:03 PM, Laurent Quérel <laurent.que...@gmail.com> 
>> wrote:
>> 
>> Hi Jorge
>> 
>> I don't think that the level of in-depth knowledge needed is the same
>> between using a row-oriented internal representation and "Arrow" which not
>> only changes the organization of the data but also introduces a set of
>> additional mapping choices and concepts.
>> 
>> For example, assuming that the initial row-oriented data source is a stream
>> of nested assembly of structures, lists and maps. The mapping of such a
>> stream to Protobuf, JSON, YAML, ... is straightforward because on both
>> sides the logical representation is exactly the same, the schema is
>> sometimes optional, the interest of building batches is optional, ... In
>> the case of "Arrow" things are different - the schema and the batching are
>> mandatory. The mapping is not necessarily direct and will generally be the
>> result of the combination of several trade-offs (normalization vs
>> denormalization representation, mapping influencing the compression rate,
>> queryability with Arrow processors like DataFusion, ...). Note that some of
>> these complexities are not intrinsically linked to the fact that the target
>> format is column oriented. The ZST format (
>> https://zed.brimdata.io/docs/formats/zst/) for example does not require an
>> explicit schema definition.
>> 
>> IMHO, having a library that allows you to easily experiment with different
>> types of mapping (without having to worry about batching, dictionaries,
>> schema definition, understanding how lists of structs are represented, ...)
>> and to evaluate the results according to your specific goals has a value
>> (especially if your criteria are compression ratio and queryability). Of
>> course there is an overhead to such an approach. In some cases, at the end
>> of the process, it will be necessary to manually perform this direct
>> transformation between a row-oriented XYZ format and "Arrow". However, this
>> effort will be done after a simple experimentation phase to avoid changes
>> in the implementation of the converter which in my opinion is not so simple
>> to implement with the current Arrow API.
>> 
>> If the Arrow developer community is not interested in integrating this
>> proposal, I plan to release two independent libraries (Go and Rust) that
>> can be used on top of the standard "Arrow" libraries. This will have the
>> advantage to evaluate if such an approach is able to raise interest among
>> Arrow users.
>> 
>> Best,
>> 
>> Laurent
>> 
>> 
>> 
>>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão <
>>> jorgecarlei...@gmail.com> wrote:
>>> 
>>> Hi Laurent,
>>> 
>>> I agree that there is a common pattern in converting row-based formats to
>>> Arrow.
>>> 
>>> Imho the difficult part is not to map the storage format to Arrow
>>> specifically - it is to map the storage format to any in-memory (row- or
>>> columnar- based) format, since it requires in-depth knowledge about the 2
>>> formats (the source format and the target format).
>>> 
>>> - Understanding the Arrow API which can be challenging for complex cases of
>>>> rows representing complex objects (list of struct, struct of struct,
>>> ...).
>>>> 
>>> 
>>> the developer would have the same problem - just shifted around - they now
>>> need to convert their complex objects to the intermediary representation.
>>> Whether it is more "difficult" or "complex" to learn than Arrow is an open
>>> question, but we would essentially be shifting the problem from "learning
>>> Arrow" to "learning the Intermediate in-memory".
>>> 
>>> @Micah Kornfield, as described before my goal is not to define a memory
>>>> layout specification but more to define an API and a translation
>>> mechanism
>>>> able to take this intermediate representation (list of generic objects
>>>> representing the entities to translate) and to convert it into one or
>>> more
>>>> Arrow records.
>>>> 
>>> 
>>> imho a spec of "list of generic objects representing the entities" is
>>> specified by an in-memory format (not by an API spec).
>>> 
>>> A second challenge I anticipate is that in-memory formats inneerently "own"
>>> the memory they outline (since by definition they outline how this memory
>>> is outlined). An Intermediate in-memory representation would be no
>>> different. Since row-based formats usually require at least one allocation
>>> per row (and often more for variable-length types), the transformation
>>> (storage format -> row-based in-memory format -> Arrow) incurs a
>>> significant cost (~2x slower last time I played with this problem in JSON
>>> [1]).
>>> 
>>> A third challenge I anticipate is that given that we have 10+ languages, we
>>> would eventually need to convert the intermediary representation across
>>> languages, which imo just hints that we would need to formalize an agnostic
>>> spec for such representation (so languages agree on its representation),
>>> and thus essentially declare a new (row-based) format.
>>> 
>>> (none of this precludes efforts to invent an in-memory row format for
>>> analytics workloads)
>>> 
>>> @Wes McKinney <wesmck...@gmail.com>
>>> 
>>> I still think having a canonical in-memory row format (and libraries
>>>> to transform to and from Arrow columnar format) is a good idea — but
>>>> there is the risk of ending up in the tar pit of reinventing Avro.
>>>> 
>>> 
>>> afaik Avro does not have O(1) random access neither to its rows nor columns
>>> - records are concatenated back to back, every record's column is
>>> concatenated back to back within a record, and there is no indexing
>>> information on how to access a particular row or column. There are blocks
>>> of rows that reduce the cost of accessing large offsets, but imo it is far
>>> from the O(1) offered by Arrow (and expected by analytics workloads).
>>> 
>>> [1] https://github.com/jorgecarleitao/arrow2/pull/1024
>>> 
>>> Best,
>>> Jorge
>>> 
>>> On Thu, Jul 28, 2022 at 5:38 AM Laurent Quérel <laurent.que...@gmail.com>
>>> wrote:
>>> 
>>>> Let me clarify the proposal a bit before replying to the various previous
>>>> feedbacks.
>>>> 
>>>> 
>>>> 
>>>> It seems to me that the process of converting a row-oriented data source
>>>> (row = set of fields or something more hierarchical) into an Arrow record
>>>> repeatedly raises the same challenges. A developer who must perform this
>>>> kind of transformation is confronted with the following questions and
>>>> problems:
>>>> 
>>>> - Understanding the Arrow API which can be challenging for complex cases
>>> of
>>>> rows representing complex objects (list of struct, struct of struct,
>>> ...).
>>>> 
>>>> - Decide which Arrow schema(s) will correspond to your data source. In
>>> some
>>>> complex cases it can be advantageous to translate the same row-oriented
>>>> data source into several Arrow schemas (e.g. OpenTelementry data
>>> sources).
>>>> 
>>>> - Decide on the encoding of the columns to make the most of the
>>>> column-oriented format and thus increase the compression rate (e.g.
>>> define
>>>> the columns that should be represent as dictionaries).
>>>> 
>>>> 
>>>> 
>>>> By experience, I can attest that this process is usually iterative. For
>>>> non-trivial data sources, arriving at the arrow representation that
>>> offers
>>>> the best compression ratio and is still perfectly usable and queryable
>>> is a
>>>> long and tedious process.
>>>> 
>>>> 
>>>> 
>>>> I see two approaches to ease this process and consequently increase the
>>>> adoption of Apache Arrow:
>>>> 
>>>> - Definition of a canonical in-memory row format specification that every
>>>> row-oriented data source provider can progressively adopt to get an
>>>> automatic translation into the Arrow format.
>>>> 
>>>> - Definition of an integration library allowing to map any row-oriented
>>>> source into a generic row-oriented source understood by the converter. It
>>>> is not about defining a unique in-memory format but more about defining a
>>>> standard API to represent row-oriented data.
>>>> 
>>>> 
>>>> 
>>>> In my opinion these two approaches are complementary. The first option
>>> is a
>>>> long-term approach targeting directly the data providers, which will
>>>> require to agree on this generic row-oriented format and whose adoption
>>>> will be more or less long. The second approach does not directly require
>>>> the collaboration of data source providers but allows an "integrator" to
>>>> perform this transformation painlessly with potentially several
>>>> representation trials to achieve the best results in his context.
>>>> 
>>>> 
>>>> 
>>>> The current proposal is an implementation of the second approach, i.e. an
>>>> API that maps a row-oriented source XYZ into an intermediate row-oriented
>>>> representation understood mechanically by the translator. This translator
>>>> also adds a series of optimizations to make the most of the Arrow format.
>>>> 
>>>> 
>>>> 
>>>> You can find multiple examples of a such transformation in the following
>>>> examples:
>>>> 
>>>>  -
>>>> 
>>>> 
>>> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/trace/otlp_to_arrow.go
>>>>  this example converts OTEL trace entities into their corresponding
>>> Arrow
>>>>  IR. At the end of this conversion the method returns a collection of
>>>> Arrow
>>>>  Records.
>>>>  - A more complex example can be found here
>>>> 
>>>> 
>>> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/metrics/otlp_to_arrow.go
>>>> .
>>>>  In this example a stream of OTEL univariate row-oriented metrics are
>>>>  translate into multivariate row-oriented metrics and then
>>> automatically
>>>>  translated into Apache Records.
>>>> 
>>>> 
>>>> 
>>>> In these two examples, the creation of dictionaries and multi-column
>>>> sorting is automatically done by the framework and the developer doesn’t
>>>> have to worry about the definition of Arrow schemas.
>>>> 
>>>> 
>>>> 
>>>> Now let's get to the answers.
>>>> 
>>>> 
>>>> 
>>>> @David Lee, I don't think Parquet and from_pylist() solve this problem
>>>> particularly well. Parquet is a column-oriented data file format and
>>>> doesn't really help to perform this transformation. The Python method is
>>>> relatively limited and language specific.
>>>> 
>>>> 
>>>> 
>>>> @Micah Kornfield, as described before my goal is not to define a memory
>>>> layout specification but more to define an API and a translation
>>> mechanism
>>>> able to take this intermediate representation (list of generic objects
>>>> representing the entities to translate) and to convert it into one or
>>> more
>>>> Arrow records.
>>>> 
>>>> 
>>>> 
>>>> @Wes McKinney, If I interpret your answer correctly, I think you are
>>>> describing the option 1 mentioned above. Like you I think it is an
>>>> interesting approach although complementary to the one I propose.
>>>> 
>>>> 
>>>> 
>>>> Looking forward to your feedback.
>>>> 
>>>> On Wed, Jul 27, 2022 at 4:19 PM Wes McKinney <wesmck...@gmail.com>
>>> wrote:
>>>> 
>>>>> We had an e-mail thread about this in 2018
>>>>> 
>>>>> https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm
>>>>> 
>>>>> I still think having a canonical in-memory row format (and libraries
>>>>> to transform to and from Arrow columnar format) is a good idea — but
>>>>> there is the risk of ending up in the tar pit of reinventing Avro.
>>>>> 
>>>>> 
>>>>> On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield <emkornfi...@gmail.com
>>>> 
>>>>> wrote:
>>>>>> 
>>>>>> Are there more details on what exactly an "Arrow Intermediate
>>>>>> Representation (AIR)" is?  We've talked about in the past maybe
>>> having
>>>> a
>>>>>> memory layout specification for row-based data as well as column
>>> based
>>>>>> data.  There was also a recent attempt at least in C++ to try to
>>> build
>>>>>> utilities to do these pivots but it was decided that it didn't add
>>> much
>>>>>> utility (it was added a comprehensive example).
>>>>>> 
>>>>>> Thanks,
>>>>>> Micah
>>>>>> 
>>>>>> On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel <
>>>> laurent.que...@gmail.com
>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> In the context of this OTEP
>>>>>>> <
>>>>> 
>>>> 
>>> https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
>>>>>>>> 
>>>>>>> (OpenTelemetry
>>>>>>> Enhancement Proposal) I developed an integration layer on top of
>>>> Apache
>>>>>>> Arrow (Go an Rust) to *facilitate the translation of row-oriented
>>>> data
>>>>>>> stream into an arrow-based columnar representation*. In this
>>>> particular
>>>>>>> case the goal was to translate all OpenTelemetry entities (metrics,
>>>>> logs,
>>>>>>> or traces) into Apache Arrow records. These entities can be quite
>>>>> complex
>>>>>>> and their corresponding Arrow schema must be defined on the fly.
>>> IMO,
>>>>> this
>>>>>>> approach is not specific to my specific needs but could be used in
>>>> many
>>>>>>> other contexts where there is a need to simplify the integration
>>>>> between a
>>>>>>> row-oriented source of data and Apache Arrow. The trade-off is to
>>>> have
>>>>> to
>>>>>>> perform the additional step of conversion to the intermediate
>>>>>>> representation, but this transformation does not require to
>>>> understand
>>>>> the
>>>>>>> arcana of the Arrow format and allows to potentially benefit from
>>>>>>> functionalities such as the encoding of the dictionary "for free",
>>>> the
>>>>>>> automatic generation of Arrow schemas, the batching, the
>>> multi-column
>>>>>>> sorting, etc.
>>>>>>> 
>>>>>>> 
>>>>>>> I know that JSON can be used as a kind of intermediate
>>> representation
>>>>> in
>>>>>>> the context of Arrow with some language specific implementation.
>>>>> Current
>>>>>>> JSON integrations are insufficient to cover the most complex
>>>> scenarios
>>>>> and
>>>>>>> are not standardized; e.g. support for most of the Arrow data type,
>>>>> various
>>>>>>> optimizations (string|binary dictionaries, multi-column sorting),
>>>>> batching,
>>>>>>> integration with Arrow IPC, compression ratio optimization, ... The
>>>>> object
>>>>>>> of this proposal is to progressively cover these gaps.
>>>>>>> 
>>>>>>> I am looking to see if the community would be interested in such a
>>>>>>> contribution. Above are some additional details on the current
>>>>>>> implementation. All feedback is welcome.
>>>>>>> 
>>>>>>> 10K ft overview of the current implementation:
>>>>>>> 
>>>>>>>  1. Developers convert their row oriented stream into records
>>> based
>>>>> on
>>>>>>>  the Arrow Intermediate Representation (AIR). At this stage the
>>>>>>> translation
>>>>>>>  can be quite mechanical but if needed developers can decide for
>>>>> example
>>>>>>> to
>>>>>>>  translate a map into a struct if that makes sense for them. The
>>>>> current
>>>>>>>  implementation support the following arrow data types: bool, all
>>>>> uints,
>>>>>>> all
>>>>>>>  ints, all floats, string, binary, list of any supported types,
>>> and
>>>>>>> struct
>>>>>>>  of any supported types. Additional Arrow types could be added
>>>>>>> progressively.
>>>>>>>  2. The row oriented record (i.e. AIR record) is then added to a
>>>>>>>  RecordRepository. This repository will first compute a schema
>>>>> signature
>>>>>>> and
>>>>>>>  will route the record to a RecordBatcher based on this
>>> signature.
>>>>>>>  3. The RecordBatcher is responsible for collecting all the
>>>>> compatible
>>>>>>>  AIR records and, upon request, the "batcher" is able to build an
>>>>> Arrow
>>>>>>>  Record representing a batch of compatible inputs. In the current
>>>>>>>  implementation, the batcher is able to convert string columns to
>>>>>>> dictionary
>>>>>>>  based on a configuration. Another configuration allows to
>>> evaluate
>>>>> which
>>>>>>>  columns should be sorted to optimize the compression ratio. The
>>>> same
>>>>>>>  optimization process could be applied to binary columns.
>>>>>>>  4. Steps 1 through 3 can be repeated on the same
>>> RecordRepository
>>>>>>>  instance to build new sets of arrow record batches. Subsequent
>>>>>>> iterations
>>>>>>>  will be slightly faster due to different techniques used (e.g.
>>>>> object
>>>>>>>  reuse, dictionary reuse and sorting, ...)
>>>>>>> 
>>>>>>> 
>>>>>>> The current Go implementation
>>>>>>> <https://github.com/lquerel/otel-arrow-adapter> (WIP) is currently
>>>>> part of
>>>>>>> this repo (see pkg/air package). If the community is interested, I
>>>>> could do
>>>>>>> a PR in the Arrow Go and Rust sub-projects.
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Laurent Quérel
>>>> 
>>> 
>> 
>> 
>> -- 
>> Laurent Quérel

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Reply via email to