He was trying to nicely say he knows way more than you, and your ideas will result in a low performance scheme no one will use in production ai/machine learning.
Sent from my iPhone > On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett <benjaminblodg...@gmail.com> > wrote: > > I think Jorge’s opinion has is that of an expert and him being humble is > just being tactful. Probably listen to Jorge on performance and > architecture, even over Wes as he’s contributed more than anyone else and > know the bleeding edge of low level performance stuff more than anyone. > > Sent from my iPhone > >> On Jul 28, 2022, at 12:03 PM, Laurent Quérel <laurent.que...@gmail.com> >> wrote: >> >> Hi Jorge >> >> I don't think that the level of in-depth knowledge needed is the same >> between using a row-oriented internal representation and "Arrow" which not >> only changes the organization of the data but also introduces a set of >> additional mapping choices and concepts. >> >> For example, assuming that the initial row-oriented data source is a stream >> of nested assembly of structures, lists and maps. The mapping of such a >> stream to Protobuf, JSON, YAML, ... is straightforward because on both >> sides the logical representation is exactly the same, the schema is >> sometimes optional, the interest of building batches is optional, ... In >> the case of "Arrow" things are different - the schema and the batching are >> mandatory. The mapping is not necessarily direct and will generally be the >> result of the combination of several trade-offs (normalization vs >> denormalization representation, mapping influencing the compression rate, >> queryability with Arrow processors like DataFusion, ...). Note that some of >> these complexities are not intrinsically linked to the fact that the target >> format is column oriented. The ZST format ( >> https://zed.brimdata.io/docs/formats/zst/) for example does not require an >> explicit schema definition. >> >> IMHO, having a library that allows you to easily experiment with different >> types of mapping (without having to worry about batching, dictionaries, >> schema definition, understanding how lists of structs are represented, ...) >> and to evaluate the results according to your specific goals has a value >> (especially if your criteria are compression ratio and queryability). Of >> course there is an overhead to such an approach. In some cases, at the end >> of the process, it will be necessary to manually perform this direct >> transformation between a row-oriented XYZ format and "Arrow". However, this >> effort will be done after a simple experimentation phase to avoid changes >> in the implementation of the converter which in my opinion is not so simple >> to implement with the current Arrow API. >> >> If the Arrow developer community is not interested in integrating this >> proposal, I plan to release two independent libraries (Go and Rust) that >> can be used on top of the standard "Arrow" libraries. This will have the >> advantage to evaluate if such an approach is able to raise interest among >> Arrow users. >> >> Best, >> >> Laurent >> >> >> >>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão < >>> jorgecarlei...@gmail.com> wrote: >>> >>> Hi Laurent, >>> >>> I agree that there is a common pattern in converting row-based formats to >>> Arrow. >>> >>> Imho the difficult part is not to map the storage format to Arrow >>> specifically - it is to map the storage format to any in-memory (row- or >>> columnar- based) format, since it requires in-depth knowledge about the 2 >>> formats (the source format and the target format). >>> >>> - Understanding the Arrow API which can be challenging for complex cases of >>>> rows representing complex objects (list of struct, struct of struct, >>> ...). >>>> >>> >>> the developer would have the same problem - just shifted around - they now >>> need to convert their complex objects to the intermediary representation. >>> Whether it is more "difficult" or "complex" to learn than Arrow is an open >>> question, but we would essentially be shifting the problem from "learning >>> Arrow" to "learning the Intermediate in-memory". >>> >>> @Micah Kornfield, as described before my goal is not to define a memory >>>> layout specification but more to define an API and a translation >>> mechanism >>>> able to take this intermediate representation (list of generic objects >>>> representing the entities to translate) and to convert it into one or >>> more >>>> Arrow records. >>>> >>> >>> imho a spec of "list of generic objects representing the entities" is >>> specified by an in-memory format (not by an API spec). >>> >>> A second challenge I anticipate is that in-memory formats inneerently "own" >>> the memory they outline (since by definition they outline how this memory >>> is outlined). An Intermediate in-memory representation would be no >>> different. Since row-based formats usually require at least one allocation >>> per row (and often more for variable-length types), the transformation >>> (storage format -> row-based in-memory format -> Arrow) incurs a >>> significant cost (~2x slower last time I played with this problem in JSON >>> [1]). >>> >>> A third challenge I anticipate is that given that we have 10+ languages, we >>> would eventually need to convert the intermediary representation across >>> languages, which imo just hints that we would need to formalize an agnostic >>> spec for such representation (so languages agree on its representation), >>> and thus essentially declare a new (row-based) format. >>> >>> (none of this precludes efforts to invent an in-memory row format for >>> analytics workloads) >>> >>> @Wes McKinney <wesmck...@gmail.com> >>> >>> I still think having a canonical in-memory row format (and libraries >>>> to transform to and from Arrow columnar format) is a good idea — but >>>> there is the risk of ending up in the tar pit of reinventing Avro. >>>> >>> >>> afaik Avro does not have O(1) random access neither to its rows nor columns >>> - records are concatenated back to back, every record's column is >>> concatenated back to back within a record, and there is no indexing >>> information on how to access a particular row or column. There are blocks >>> of rows that reduce the cost of accessing large offsets, but imo it is far >>> from the O(1) offered by Arrow (and expected by analytics workloads). >>> >>> [1] https://github.com/jorgecarleitao/arrow2/pull/1024 >>> >>> Best, >>> Jorge >>> >>> On Thu, Jul 28, 2022 at 5:38 AM Laurent Quérel <laurent.que...@gmail.com> >>> wrote: >>> >>>> Let me clarify the proposal a bit before replying to the various previous >>>> feedbacks. >>>> >>>> >>>> >>>> It seems to me that the process of converting a row-oriented data source >>>> (row = set of fields or something more hierarchical) into an Arrow record >>>> repeatedly raises the same challenges. A developer who must perform this >>>> kind of transformation is confronted with the following questions and >>>> problems: >>>> >>>> - Understanding the Arrow API which can be challenging for complex cases >>> of >>>> rows representing complex objects (list of struct, struct of struct, >>> ...). >>>> >>>> - Decide which Arrow schema(s) will correspond to your data source. In >>> some >>>> complex cases it can be advantageous to translate the same row-oriented >>>> data source into several Arrow schemas (e.g. OpenTelementry data >>> sources). >>>> >>>> - Decide on the encoding of the columns to make the most of the >>>> column-oriented format and thus increase the compression rate (e.g. >>> define >>>> the columns that should be represent as dictionaries). >>>> >>>> >>>> >>>> By experience, I can attest that this process is usually iterative. For >>>> non-trivial data sources, arriving at the arrow representation that >>> offers >>>> the best compression ratio and is still perfectly usable and queryable >>> is a >>>> long and tedious process. >>>> >>>> >>>> >>>> I see two approaches to ease this process and consequently increase the >>>> adoption of Apache Arrow: >>>> >>>> - Definition of a canonical in-memory row format specification that every >>>> row-oriented data source provider can progressively adopt to get an >>>> automatic translation into the Arrow format. >>>> >>>> - Definition of an integration library allowing to map any row-oriented >>>> source into a generic row-oriented source understood by the converter. It >>>> is not about defining a unique in-memory format but more about defining a >>>> standard API to represent row-oriented data. >>>> >>>> >>>> >>>> In my opinion these two approaches are complementary. The first option >>> is a >>>> long-term approach targeting directly the data providers, which will >>>> require to agree on this generic row-oriented format and whose adoption >>>> will be more or less long. The second approach does not directly require >>>> the collaboration of data source providers but allows an "integrator" to >>>> perform this transformation painlessly with potentially several >>>> representation trials to achieve the best results in his context. >>>> >>>> >>>> >>>> The current proposal is an implementation of the second approach, i.e. an >>>> API that maps a row-oriented source XYZ into an intermediate row-oriented >>>> representation understood mechanically by the translator. This translator >>>> also adds a series of optimizations to make the most of the Arrow format. >>>> >>>> >>>> >>>> You can find multiple examples of a such transformation in the following >>>> examples: >>>> >>>> - >>>> >>>> >>> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/trace/otlp_to_arrow.go >>>> this example converts OTEL trace entities into their corresponding >>> Arrow >>>> IR. At the end of this conversion the method returns a collection of >>>> Arrow >>>> Records. >>>> - A more complex example can be found here >>>> >>>> >>> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/metrics/otlp_to_arrow.go >>>> . >>>> In this example a stream of OTEL univariate row-oriented metrics are >>>> translate into multivariate row-oriented metrics and then >>> automatically >>>> translated into Apache Records. >>>> >>>> >>>> >>>> In these two examples, the creation of dictionaries and multi-column >>>> sorting is automatically done by the framework and the developer doesn’t >>>> have to worry about the definition of Arrow schemas. >>>> >>>> >>>> >>>> Now let's get to the answers. >>>> >>>> >>>> >>>> @David Lee, I don't think Parquet and from_pylist() solve this problem >>>> particularly well. Parquet is a column-oriented data file format and >>>> doesn't really help to perform this transformation. The Python method is >>>> relatively limited and language specific. >>>> >>>> >>>> >>>> @Micah Kornfield, as described before my goal is not to define a memory >>>> layout specification but more to define an API and a translation >>> mechanism >>>> able to take this intermediate representation (list of generic objects >>>> representing the entities to translate) and to convert it into one or >>> more >>>> Arrow records. >>>> >>>> >>>> >>>> @Wes McKinney, If I interpret your answer correctly, I think you are >>>> describing the option 1 mentioned above. Like you I think it is an >>>> interesting approach although complementary to the one I propose. >>>> >>>> >>>> >>>> Looking forward to your feedback. >>>> >>>> On Wed, Jul 27, 2022 at 4:19 PM Wes McKinney <wesmck...@gmail.com> >>> wrote: >>>> >>>>> We had an e-mail thread about this in 2018 >>>>> >>>>> https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm >>>>> >>>>> I still think having a canonical in-memory row format (and libraries >>>>> to transform to and from Arrow columnar format) is a good idea — but >>>>> there is the risk of ending up in the tar pit of reinventing Avro. >>>>> >>>>> >>>>> On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield <emkornfi...@gmail.com >>>> >>>>> wrote: >>>>>> >>>>>> Are there more details on what exactly an "Arrow Intermediate >>>>>> Representation (AIR)" is? We've talked about in the past maybe >>> having >>>> a >>>>>> memory layout specification for row-based data as well as column >>> based >>>>>> data. There was also a recent attempt at least in C++ to try to >>> build >>>>>> utilities to do these pivots but it was decided that it didn't add >>> much >>>>>> utility (it was added a comprehensive example). >>>>>> >>>>>> Thanks, >>>>>> Micah >>>>>> >>>>>> On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel < >>>> laurent.que...@gmail.com >>>>>> >>>>>> wrote: >>>>>> >>>>>>> In the context of this OTEP >>>>>>> < >>>>> >>>> >>> https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md >>>>>>>> >>>>>>> (OpenTelemetry >>>>>>> Enhancement Proposal) I developed an integration layer on top of >>>> Apache >>>>>>> Arrow (Go an Rust) to *facilitate the translation of row-oriented >>>> data >>>>>>> stream into an arrow-based columnar representation*. In this >>>> particular >>>>>>> case the goal was to translate all OpenTelemetry entities (metrics, >>>>> logs, >>>>>>> or traces) into Apache Arrow records. These entities can be quite >>>>> complex >>>>>>> and their corresponding Arrow schema must be defined on the fly. >>> IMO, >>>>> this >>>>>>> approach is not specific to my specific needs but could be used in >>>> many >>>>>>> other contexts where there is a need to simplify the integration >>>>> between a >>>>>>> row-oriented source of data and Apache Arrow. The trade-off is to >>>> have >>>>> to >>>>>>> perform the additional step of conversion to the intermediate >>>>>>> representation, but this transformation does not require to >>>> understand >>>>> the >>>>>>> arcana of the Arrow format and allows to potentially benefit from >>>>>>> functionalities such as the encoding of the dictionary "for free", >>>> the >>>>>>> automatic generation of Arrow schemas, the batching, the >>> multi-column >>>>>>> sorting, etc. >>>>>>> >>>>>>> >>>>>>> I know that JSON can be used as a kind of intermediate >>> representation >>>>> in >>>>>>> the context of Arrow with some language specific implementation. >>>>> Current >>>>>>> JSON integrations are insufficient to cover the most complex >>>> scenarios >>>>> and >>>>>>> are not standardized; e.g. support for most of the Arrow data type, >>>>> various >>>>>>> optimizations (string|binary dictionaries, multi-column sorting), >>>>> batching, >>>>>>> integration with Arrow IPC, compression ratio optimization, ... The >>>>> object >>>>>>> of this proposal is to progressively cover these gaps. >>>>>>> >>>>>>> I am looking to see if the community would be interested in such a >>>>>>> contribution. Above are some additional details on the current >>>>>>> implementation. All feedback is welcome. >>>>>>> >>>>>>> 10K ft overview of the current implementation: >>>>>>> >>>>>>> 1. Developers convert their row oriented stream into records >>> based >>>>> on >>>>>>> the Arrow Intermediate Representation (AIR). At this stage the >>>>>>> translation >>>>>>> can be quite mechanical but if needed developers can decide for >>>>> example >>>>>>> to >>>>>>> translate a map into a struct if that makes sense for them. The >>>>> current >>>>>>> implementation support the following arrow data types: bool, all >>>>> uints, >>>>>>> all >>>>>>> ints, all floats, string, binary, list of any supported types, >>> and >>>>>>> struct >>>>>>> of any supported types. Additional Arrow types could be added >>>>>>> progressively. >>>>>>> 2. The row oriented record (i.e. AIR record) is then added to a >>>>>>> RecordRepository. This repository will first compute a schema >>>>> signature >>>>>>> and >>>>>>> will route the record to a RecordBatcher based on this >>> signature. >>>>>>> 3. The RecordBatcher is responsible for collecting all the >>>>> compatible >>>>>>> AIR records and, upon request, the "batcher" is able to build an >>>>> Arrow >>>>>>> Record representing a batch of compatible inputs. In the current >>>>>>> implementation, the batcher is able to convert string columns to >>>>>>> dictionary >>>>>>> based on a configuration. Another configuration allows to >>> evaluate >>>>> which >>>>>>> columns should be sorted to optimize the compression ratio. The >>>> same >>>>>>> optimization process could be applied to binary columns. >>>>>>> 4. Steps 1 through 3 can be repeated on the same >>> RecordRepository >>>>>>> instance to build new sets of arrow record batches. Subsequent >>>>>>> iterations >>>>>>> will be slightly faster due to different techniques used (e.g. >>>>> object >>>>>>> reuse, dictionary reuse and sorting, ...) >>>>>>> >>>>>>> >>>>>>> The current Go implementation >>>>>>> <https://github.com/lquerel/otel-arrow-adapter> (WIP) is currently >>>>> part of >>>>>>> this repo (see pkg/air package). If the community is interested, I >>>>> could do >>>>>>> a PR in the Arrow Go and Rust sub-projects. >>>>>>> >>>>> >>>> >>>> >>>> -- >>>> Laurent Quérel >>>> >>> >> >> >> -- >> Laurent Quérel