Please note this message and the previous one from the author violate our Code of Conduct [1]. Specifically "Do not insult or put down other participants." Please try to be professional in communications and focus on the technical issues at hand.
[1] https://www.apache.org/foundation/policies/conduct.html On Thu, Jul 28, 2022 at 12:16 PM Benjamin Blodgett < benjaminblodg...@gmail.com> wrote: > He was trying to nicely say he knows way more than you, and your ideas > will result in a low performance scheme no one will use in production > ai/machine learning. > > Sent from my iPhone > > > On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett < > benjaminblodg...@gmail.com> wrote: > > > > I think Jorge’s opinion has is that of an expert and him being humble > is just being tactful. Probably listen to Jorge on performance and > architecture, even over Wes as he’s contributed more than anyone else and > know the bleeding edge of low level performance stuff more than anyone. > > > > Sent from my iPhone > > > >> On Jul 28, 2022, at 12:03 PM, Laurent Quérel <laurent.que...@gmail.com> > wrote: > >> > >> Hi Jorge > >> > >> I don't think that the level of in-depth knowledge needed is the same > >> between using a row-oriented internal representation and "Arrow" which > not > >> only changes the organization of the data but also introduces a set of > >> additional mapping choices and concepts. > >> > >> For example, assuming that the initial row-oriented data source is a > stream > >> of nested assembly of structures, lists and maps. The mapping of such a > >> stream to Protobuf, JSON, YAML, ... is straightforward because on both > >> sides the logical representation is exactly the same, the schema is > >> sometimes optional, the interest of building batches is optional, ... In > >> the case of "Arrow" things are different - the schema and the batching > are > >> mandatory. The mapping is not necessarily direct and will generally be > the > >> result of the combination of several trade-offs (normalization vs > >> denormalization representation, mapping influencing the compression > rate, > >> queryability with Arrow processors like DataFusion, ...). Note that > some of > >> these complexities are not intrinsically linked to the fact that the > target > >> format is column oriented. The ZST format ( > >> https://zed.brimdata.io/docs/formats/zst/) for example does not > require an > >> explicit schema definition. > >> > >> IMHO, having a library that allows you to easily experiment with > different > >> types of mapping (without having to worry about batching, dictionaries, > >> schema definition, understanding how lists of structs are represented, > ...) > >> and to evaluate the results according to your specific goals has a value > >> (especially if your criteria are compression ratio and queryability). Of > >> course there is an overhead to such an approach. In some cases, at the > end > >> of the process, it will be necessary to manually perform this direct > >> transformation between a row-oriented XYZ format and "Arrow". However, > this > >> effort will be done after a simple experimentation phase to avoid > changes > >> in the implementation of the converter which in my opinion is not so > simple > >> to implement with the current Arrow API. > >> > >> If the Arrow developer community is not interested in integrating this > >> proposal, I plan to release two independent libraries (Go and Rust) that > >> can be used on top of the standard "Arrow" libraries. This will have the > >> advantage to evaluate if such an approach is able to raise interest > among > >> Arrow users. > >> > >> Best, > >> > >> Laurent > >> > >> > >> > >>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão < > >>> jorgecarlei...@gmail.com> wrote: > >>> > >>> Hi Laurent, > >>> > >>> I agree that there is a common pattern in converting row-based formats > to > >>> Arrow. > >>> > >>> Imho the difficult part is not to map the storage format to Arrow > >>> specifically - it is to map the storage format to any in-memory (row- > or > >>> columnar- based) format, since it requires in-depth knowledge about > the 2 > >>> formats (the source format and the target format). > >>> > >>> - Understanding the Arrow API which can be challenging for complex > cases of > >>>> rows representing complex objects (list of struct, struct of struct, > >>> ...). > >>>> > >>> > >>> the developer would have the same problem - just shifted around - they > now > >>> need to convert their complex objects to the intermediary > representation. > >>> Whether it is more "difficult" or "complex" to learn than Arrow is an > open > >>> question, but we would essentially be shifting the problem from > "learning > >>> Arrow" to "learning the Intermediate in-memory". > >>> > >>> @Micah Kornfield, as described before my goal is not to define a memory > >>>> layout specification but more to define an API and a translation > >>> mechanism > >>>> able to take this intermediate representation (list of generic objects > >>>> representing the entities to translate) and to convert it into one or > >>> more > >>>> Arrow records. > >>>> > >>> > >>> imho a spec of "list of generic objects representing the entities" is > >>> specified by an in-memory format (not by an API spec). > >>> > >>> A second challenge I anticipate is that in-memory formats inneerently > "own" > >>> the memory they outline (since by definition they outline how this > memory > >>> is outlined). An Intermediate in-memory representation would be no > >>> different. Since row-based formats usually require at least one > allocation > >>> per row (and often more for variable-length types), the transformation > >>> (storage format -> row-based in-memory format -> Arrow) incurs a > >>> significant cost (~2x slower last time I played with this problem in > JSON > >>> [1]). > >>> > >>> A third challenge I anticipate is that given that we have 10+ > languages, we > >>> would eventually need to convert the intermediary representation across > >>> languages, which imo just hints that we would need to formalize an > agnostic > >>> spec for such representation (so languages agree on its > representation), > >>> and thus essentially declare a new (row-based) format. > >>> > >>> (none of this precludes efforts to invent an in-memory row format for > >>> analytics workloads) > >>> > >>> @Wes McKinney <wesmck...@gmail.com> > >>> > >>> I still think having a canonical in-memory row format (and libraries > >>>> to transform to and from Arrow columnar format) is a good idea — but > >>>> there is the risk of ending up in the tar pit of reinventing Avro. > >>>> > >>> > >>> afaik Avro does not have O(1) random access neither to its rows nor > columns > >>> - records are concatenated back to back, every record's column is > >>> concatenated back to back within a record, and there is no indexing > >>> information on how to access a particular row or column. There are > blocks > >>> of rows that reduce the cost of accessing large offsets, but imo it is > far > >>> from the O(1) offered by Arrow (and expected by analytics workloads). > >>> > >>> [1] https://github.com/jorgecarleitao/arrow2/pull/1024 > >>> > >>> Best, > >>> Jorge > >>> > >>> On Thu, Jul 28, 2022 at 5:38 AM Laurent Quérel < > laurent.que...@gmail.com> > >>> wrote: > >>> > >>>> Let me clarify the proposal a bit before replying to the various > previous > >>>> feedbacks. > >>>> > >>>> > >>>> > >>>> It seems to me that the process of converting a row-oriented data > source > >>>> (row = set of fields or something more hierarchical) into an Arrow > record > >>>> repeatedly raises the same challenges. A developer who must perform > this > >>>> kind of transformation is confronted with the following questions and > >>>> problems: > >>>> > >>>> - Understanding the Arrow API which can be challenging for complex > cases > >>> of > >>>> rows representing complex objects (list of struct, struct of struct, > >>> ...). > >>>> > >>>> - Decide which Arrow schema(s) will correspond to your data source. In > >>> some > >>>> complex cases it can be advantageous to translate the same > row-oriented > >>>> data source into several Arrow schemas (e.g. OpenTelementry data > >>> sources). > >>>> > >>>> - Decide on the encoding of the columns to make the most of the > >>>> column-oriented format and thus increase the compression rate (e.g. > >>> define > >>>> the columns that should be represent as dictionaries). > >>>> > >>>> > >>>> > >>>> By experience, I can attest that this process is usually iterative. > For > >>>> non-trivial data sources, arriving at the arrow representation that > >>> offers > >>>> the best compression ratio and is still perfectly usable and queryable > >>> is a > >>>> long and tedious process. > >>>> > >>>> > >>>> > >>>> I see two approaches to ease this process and consequently increase > the > >>>> adoption of Apache Arrow: > >>>> > >>>> - Definition of a canonical in-memory row format specification that > every > >>>> row-oriented data source provider can progressively adopt to get an > >>>> automatic translation into the Arrow format. > >>>> > >>>> - Definition of an integration library allowing to map any > row-oriented > >>>> source into a generic row-oriented source understood by the > converter. It > >>>> is not about defining a unique in-memory format but more about > defining a > >>>> standard API to represent row-oriented data. > >>>> > >>>> > >>>> > >>>> In my opinion these two approaches are complementary. The first option > >>> is a > >>>> long-term approach targeting directly the data providers, which will > >>>> require to agree on this generic row-oriented format and whose > adoption > >>>> will be more or less long. The second approach does not directly > require > >>>> the collaboration of data source providers but allows an "integrator" > to > >>>> perform this transformation painlessly with potentially several > >>>> representation trials to achieve the best results in his context. > >>>> > >>>> > >>>> > >>>> The current proposal is an implementation of the second approach, > i.e. an > >>>> API that maps a row-oriented source XYZ into an intermediate > row-oriented > >>>> representation understood mechanically by the translator. This > translator > >>>> also adds a series of optimizations to make the most of the Arrow > format. > >>>> > >>>> > >>>> > >>>> You can find multiple examples of a such transformation in the > following > >>>> examples: > >>>> > >>>> - > >>>> > >>>> > >>> > https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/trace/otlp_to_arrow.go > >>>> this example converts OTEL trace entities into their corresponding > >>> Arrow > >>>> IR. At the end of this conversion the method returns a collection of > >>>> Arrow > >>>> Records. > >>>> - A more complex example can be found here > >>>> > >>>> > >>> > https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/metrics/otlp_to_arrow.go > >>>> . > >>>> In this example a stream of OTEL univariate row-oriented metrics are > >>>> translate into multivariate row-oriented metrics and then > >>> automatically > >>>> translated into Apache Records. > >>>> > >>>> > >>>> > >>>> In these two examples, the creation of dictionaries and multi-column > >>>> sorting is automatically done by the framework and the developer > doesn’t > >>>> have to worry about the definition of Arrow schemas. > >>>> > >>>> > >>>> > >>>> Now let's get to the answers. > >>>> > >>>> > >>>> > >>>> @David Lee, I don't think Parquet and from_pylist() solve this problem > >>>> particularly well. Parquet is a column-oriented data file format and > >>>> doesn't really help to perform this transformation. The Python method > is > >>>> relatively limited and language specific. > >>>> > >>>> > >>>> > >>>> @Micah Kornfield, as described before my goal is not to define a > memory > >>>> layout specification but more to define an API and a translation > >>> mechanism > >>>> able to take this intermediate representation (list of generic objects > >>>> representing the entities to translate) and to convert it into one or > >>> more > >>>> Arrow records. > >>>> > >>>> > >>>> > >>>> @Wes McKinney, If I interpret your answer correctly, I think you are > >>>> describing the option 1 mentioned above. Like you I think it is an > >>>> interesting approach although complementary to the one I propose. > >>>> > >>>> > >>>> > >>>> Looking forward to your feedback. > >>>> > >>>> On Wed, Jul 27, 2022 at 4:19 PM Wes McKinney <wesmck...@gmail.com> > >>> wrote: > >>>> > >>>>> We had an e-mail thread about this in 2018 > >>>>> > >>>>> https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm > >>>>> > >>>>> I still think having a canonical in-memory row format (and libraries > >>>>> to transform to and from Arrow columnar format) is a good idea — but > >>>>> there is the risk of ending up in the tar pit of reinventing Avro. > >>>>> > >>>>> > >>>>> On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield < > emkornfi...@gmail.com > >>>> > >>>>> wrote: > >>>>>> > >>>>>> Are there more details on what exactly an "Arrow Intermediate > >>>>>> Representation (AIR)" is? We've talked about in the past maybe > >>> having > >>>> a > >>>>>> memory layout specification for row-based data as well as column > >>> based > >>>>>> data. There was also a recent attempt at least in C++ to try to > >>> build > >>>>>> utilities to do these pivots but it was decided that it didn't add > >>> much > >>>>>> utility (it was added a comprehensive example). > >>>>>> > >>>>>> Thanks, > >>>>>> Micah > >>>>>> > >>>>>> On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel < > >>>> laurent.que...@gmail.com > >>>>>> > >>>>>> wrote: > >>>>>> > >>>>>>> In the context of this OTEP > >>>>>>> < > >>>>> > >>>> > >>> > https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md > >>>>>>>> > >>>>>>> (OpenTelemetry > >>>>>>> Enhancement Proposal) I developed an integration layer on top of > >>>> Apache > >>>>>>> Arrow (Go an Rust) to *facilitate the translation of row-oriented > >>>> data > >>>>>>> stream into an arrow-based columnar representation*. In this > >>>> particular > >>>>>>> case the goal was to translate all OpenTelemetry entities (metrics, > >>>>> logs, > >>>>>>> or traces) into Apache Arrow records. These entities can be quite > >>>>> complex > >>>>>>> and their corresponding Arrow schema must be defined on the fly. > >>> IMO, > >>>>> this > >>>>>>> approach is not specific to my specific needs but could be used in > >>>> many > >>>>>>> other contexts where there is a need to simplify the integration > >>>>> between a > >>>>>>> row-oriented source of data and Apache Arrow. The trade-off is to > >>>> have > >>>>> to > >>>>>>> perform the additional step of conversion to the intermediate > >>>>>>> representation, but this transformation does not require to > >>>> understand > >>>>> the > >>>>>>> arcana of the Arrow format and allows to potentially benefit from > >>>>>>> functionalities such as the encoding of the dictionary "for free", > >>>> the > >>>>>>> automatic generation of Arrow schemas, the batching, the > >>> multi-column > >>>>>>> sorting, etc. > >>>>>>> > >>>>>>> > >>>>>>> I know that JSON can be used as a kind of intermediate > >>> representation > >>>>> in > >>>>>>> the context of Arrow with some language specific implementation. > >>>>> Current > >>>>>>> JSON integrations are insufficient to cover the most complex > >>>> scenarios > >>>>> and > >>>>>>> are not standardized; e.g. support for most of the Arrow data type, > >>>>> various > >>>>>>> optimizations (string|binary dictionaries, multi-column sorting), > >>>>> batching, > >>>>>>> integration with Arrow IPC, compression ratio optimization, ... The > >>>>> object > >>>>>>> of this proposal is to progressively cover these gaps. > >>>>>>> > >>>>>>> I am looking to see if the community would be interested in such a > >>>>>>> contribution. Above are some additional details on the current > >>>>>>> implementation. All feedback is welcome. > >>>>>>> > >>>>>>> 10K ft overview of the current implementation: > >>>>>>> > >>>>>>> 1. Developers convert their row oriented stream into records > >>> based > >>>>> on > >>>>>>> the Arrow Intermediate Representation (AIR). At this stage the > >>>>>>> translation > >>>>>>> can be quite mechanical but if needed developers can decide for > >>>>> example > >>>>>>> to > >>>>>>> translate a map into a struct if that makes sense for them. The > >>>>> current > >>>>>>> implementation support the following arrow data types: bool, all > >>>>> uints, > >>>>>>> all > >>>>>>> ints, all floats, string, binary, list of any supported types, > >>> and > >>>>>>> struct > >>>>>>> of any supported types. Additional Arrow types could be added > >>>>>>> progressively. > >>>>>>> 2. The row oriented record (i.e. AIR record) is then added to a > >>>>>>> RecordRepository. This repository will first compute a schema > >>>>> signature > >>>>>>> and > >>>>>>> will route the record to a RecordBatcher based on this > >>> signature. > >>>>>>> 3. The RecordBatcher is responsible for collecting all the > >>>>> compatible > >>>>>>> AIR records and, upon request, the "batcher" is able to build an > >>>>> Arrow > >>>>>>> Record representing a batch of compatible inputs. In the current > >>>>>>> implementation, the batcher is able to convert string columns to > >>>>>>> dictionary > >>>>>>> based on a configuration. Another configuration allows to > >>> evaluate > >>>>> which > >>>>>>> columns should be sorted to optimize the compression ratio. The > >>>> same > >>>>>>> optimization process could be applied to binary columns. > >>>>>>> 4. Steps 1 through 3 can be repeated on the same > >>> RecordRepository > >>>>>>> instance to build new sets of arrow record batches. Subsequent > >>>>>>> iterations > >>>>>>> will be slightly faster due to different techniques used (e.g. > >>>>> object > >>>>>>> reuse, dictionary reuse and sorting, ...) > >>>>>>> > >>>>>>> > >>>>>>> The current Go implementation > >>>>>>> <https://github.com/lquerel/otel-arrow-adapter> (WIP) is currently > >>>>> part of > >>>>>>> this repo (see pkg/air package). If the community is interested, I > >>>>> could do > >>>>>>> a PR in the Arrow Go and Rust sub-projects. > >>>>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> Laurent Quérel > >>>> > >>> > >> > >> > >> -- > >> Laurent Quérel >