Hi everyone, I just wanted to chime in that we already do have a form of row-oriented storage inside of `arrow/compute/row/row_internal.h`. It is used to store rows inside of GroupBy and Join within Acero. We also have utilities for converting to/from columnar storage (and AVX2 implementations of these conversions) inside of `arrow/compute/row/encode_internal.h`. Would it be useful to standardize this row-oriented format?
As far as I understand fixed-width rows would be trivially convertible into this representation (just a pointer to your array of structs), while variable-width rows would need a little bit of massaging (though not too much) to be put into this representation. Sasha Krassovsky > On Jul 28, 2022, at 1:10 PM, Laurent Quérel <laurent.que...@gmail.com> wrote: > > Thank you Micah for a very clear summary of the intent behind this > proposal. Indeed, I think that clarifying from the beginning that this > approach aims at facilitating experimentation more than efficiency in terms > of performance of the transformation phase would have helped to better > understand my objective. > > Regarding your question, I don't think there is a specific technical reason > for such an integration in the core library. I was just thinking that it > would make this infrastructure easier to find for the users and that this > topic was general enough to find its place in the standard library. > > Best, > Laurent > > On Thu, Jul 28, 2022 at 12:50 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Hi Laurent, >> I'm retitling this thread to include the specific languages you seem to be >> targeting in the subject line to hopefully get more eyes from maintainers >> in those languages. >> >> Thanks for clarifying the goals. If I can restate my understanding, the >> intended use-case here is to provide easy (from the developer point of >> view) adaptation of row based formats to Arrow. The means of achieving >> this is creating an API for a row-base structure, and having utility >> classes that can manipulate the interface to build up batches (there are no >> serialization or in memory spec associated with this API). People wishing >> to integrate a specific row based format, can extend that API at whatever >> level makes sense for the format. >> >> I think this would be useful infrastructure as long as it was made clear >> that in many cases this wouldn't be the most efficient way to convert to >> Arrow from other formats. >> >> I don't work much with either the Rust or Go implementation, so I can't >> speak to if there is maintainer support for incorporating the changes >> directly in Arrow. Is there any technical reasons for preferring to have >> this included directly in Arrow vs a separate library? >> >> Cheers, >> Micah >> >> On Thu, Jul 28, 2022 at 12:34 PM Laurent Quérel <laurent.que...@gmail.com> >> wrote: >> >>> Far be it from me to think that I know more than Jorge or Wes on this >>> subject. Sorry if my post gives that perception, that is clearly not my >>> intention. I'm just trying to defend the idea that when designing this >> kind >>> of transformation, it might be interesting to have a library to test >>> several mappings and evaluate them before doing a more direct >>> implementation if the performance is not there. >>> >>> On Thu, Jul 28, 2022 at 12:15 PM Benjamin Blodgett < >>> benjaminblodg...@gmail.com> wrote: >>> >>>> He was trying to nicely say he knows way more than you, and your ideas >>>> will result in a low performance scheme no one will use in production >>>> ai/machine learning. >>>> >>>> Sent from my iPhone >>>> >>>>> On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett < >>>> benjaminblodg...@gmail.com> wrote: >>>>> >>>>> I think Jorge’s opinion has is that of an expert and him being >> humble >>>> is just being tactful. Probably listen to Jorge on performance and >>>> architecture, even over Wes as he’s contributed more than anyone else >> and >>>> know the bleeding edge of low level performance stuff more than anyone. >>>>> >>>>> Sent from my iPhone >>>>> >>>>>> On Jul 28, 2022, at 12:03 PM, Laurent Quérel < >>> laurent.que...@gmail.com> >>>> wrote: >>>>>> >>>>>> Hi Jorge >>>>>> >>>>>> I don't think that the level of in-depth knowledge needed is the >> same >>>>>> between using a row-oriented internal representation and "Arrow" >> which >>>> not >>>>>> only changes the organization of the data but also introduces a set >> of >>>>>> additional mapping choices and concepts. >>>>>> >>>>>> For example, assuming that the initial row-oriented data source is a >>>> stream >>>>>> of nested assembly of structures, lists and maps. The mapping of >> such >>> a >>>>>> stream to Protobuf, JSON, YAML, ... is straightforward because on >> both >>>>>> sides the logical representation is exactly the same, the schema is >>>>>> sometimes optional, the interest of building batches is optional, >> ... >>> In >>>>>> the case of "Arrow" things are different - the schema and the >> batching >>>> are >>>>>> mandatory. The mapping is not necessarily direct and will generally >> be >>>> the >>>>>> result of the combination of several trade-offs (normalization vs >>>>>> denormalization representation, mapping influencing the compression >>>> rate, >>>>>> queryability with Arrow processors like DataFusion, ...). Note that >>>> some of >>>>>> these complexities are not intrinsically linked to the fact that the >>>> target >>>>>> format is column oriented. The ZST format ( >>>>>> https://zed.brimdata.io/docs/formats/zst/) for example does not >>>> require an >>>>>> explicit schema definition. >>>>>> >>>>>> IMHO, having a library that allows you to easily experiment with >>>> different >>>>>> types of mapping (without having to worry about batching, >>> dictionaries, >>>>>> schema definition, understanding how lists of structs are >> represented, >>>> ...) >>>>>> and to evaluate the results according to your specific goals has a >>> value >>>>>> (especially if your criteria are compression ratio and >> queryability). >>> Of >>>>>> course there is an overhead to such an approach. In some cases, at >> the >>>> end >>>>>> of the process, it will be necessary to manually perform this direct >>>>>> transformation between a row-oriented XYZ format and "Arrow". >> However, >>>> this >>>>>> effort will be done after a simple experimentation phase to avoid >>>> changes >>>>>> in the implementation of the converter which in my opinion is not so >>>> simple >>>>>> to implement with the current Arrow API. >>>>>> >>>>>> If the Arrow developer community is not interested in integrating >> this >>>>>> proposal, I plan to release two independent libraries (Go and Rust) >>> that >>>>>> can be used on top of the standard "Arrow" libraries. This will have >>> the >>>>>> advantage to evaluate if such an approach is able to raise interest >>>> among >>>>>> Arrow users. >>>>>> >>>>>> Best, >>>>>> >>>>>> Laurent >>>>>> >>>>>> >>>>>> >>>>>>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão < >>>>>>> jorgecarlei...@gmail.com> wrote: >>>>>>> >>>>>>> Hi Laurent, >>>>>>> >>>>>>> I agree that there is a common pattern in converting row-based >>> formats >>>> to >>>>>>> Arrow. >>>>>>> >>>>>>> Imho the difficult part is not to map the storage format to Arrow >>>>>>> specifically - it is to map the storage format to any in-memory >> (row- >>>> or >>>>>>> columnar- based) format, since it requires in-depth knowledge about >>>> the 2 >>>>>>> formats (the source format and the target format). >>>>>>> >>>>>>> - Understanding the Arrow API which can be challenging for complex >>>> cases of >>>>>>>> rows representing complex objects (list of struct, struct of >> struct, >>>>>>> ...). >>>>>>>> >>>>>>> >>>>>>> the developer would have the same problem - just shifted around - >>> they >>>> now >>>>>>> need to convert their complex objects to the intermediary >>>> representation. >>>>>>> Whether it is more "difficult" or "complex" to learn than Arrow is >> an >>>> open >>>>>>> question, but we would essentially be shifting the problem from >>>> "learning >>>>>>> Arrow" to "learning the Intermediate in-memory". >>>>>>> >>>>>>> @Micah Kornfield, as described before my goal is not to define a >>> memory >>>>>>>> layout specification but more to define an API and a translation >>>>>>> mechanism >>>>>>>> able to take this intermediate representation (list of generic >>> objects >>>>>>>> representing the entities to translate) and to convert it into one >>> or >>>>>>> more >>>>>>>> Arrow records. >>>>>>>> >>>>>>> >>>>>>> imho a spec of "list of generic objects representing the entities" >> is >>>>>>> specified by an in-memory format (not by an API spec). >>>>>>> >>>>>>> A second challenge I anticipate is that in-memory formats >> inneerently >>>> "own" >>>>>>> the memory they outline (since by definition they outline how this >>>> memory >>>>>>> is outlined). An Intermediate in-memory representation would be no >>>>>>> different. Since row-based formats usually require at least one >>>> allocation >>>>>>> per row (and often more for variable-length types), the >>> transformation >>>>>>> (storage format -> row-based in-memory format -> Arrow) incurs a >>>>>>> significant cost (~2x slower last time I played with this problem >> in >>>> JSON >>>>>>> [1]). >>>>>>> >>>>>>> A third challenge I anticipate is that given that we have 10+ >>>> languages, we >>>>>>> would eventually need to convert the intermediary representation >>> across >>>>>>> languages, which imo just hints that we would need to formalize an >>>> agnostic >>>>>>> spec for such representation (so languages agree on its >>>> representation), >>>>>>> and thus essentially declare a new (row-based) format. >>>>>>> >>>>>>> (none of this precludes efforts to invent an in-memory row format >> for >>>>>>> analytics workloads) >>>>>>> >>>>>>> @Wes McKinney <wesmck...@gmail.com> >>>>>>> >>>>>>> I still think having a canonical in-memory row format (and >> libraries >>>>>>>> to transform to and from Arrow columnar format) is a good idea — >> but >>>>>>>> there is the risk of ending up in the tar pit of reinventing Avro. >>>>>>>> >>>>>>> >>>>>>> afaik Avro does not have O(1) random access neither to its rows nor >>>> columns >>>>>>> - records are concatenated back to back, every record's column is >>>>>>> concatenated back to back within a record, and there is no indexing >>>>>>> information on how to access a particular row or column. There are >>>> blocks >>>>>>> of rows that reduce the cost of accessing large offsets, but imo it >>> is >>>> far >>>>>>> from the O(1) offered by Arrow (and expected by analytics >> workloads). >>>>>>> >>>>>>> [1] https://github.com/jorgecarleitao/arrow2/pull/1024 >>>>>>> >>>>>>> Best, >>>>>>> Jorge >>>>>>> >>>>>>> On Thu, Jul 28, 2022 at 5:38 AM Laurent Quérel < >>>> laurent.que...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Let me clarify the proposal a bit before replying to the various >>>> previous >>>>>>>> feedbacks. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> It seems to me that the process of converting a row-oriented data >>>> source >>>>>>>> (row = set of fields or something more hierarchical) into an Arrow >>>> record >>>>>>>> repeatedly raises the same challenges. A developer who must >> perform >>>> this >>>>>>>> kind of transformation is confronted with the following questions >>> and >>>>>>>> problems: >>>>>>>> >>>>>>>> - Understanding the Arrow API which can be challenging for complex >>>> cases >>>>>>> of >>>>>>>> rows representing complex objects (list of struct, struct of >> struct, >>>>>>> ...). >>>>>>>> >>>>>>>> - Decide which Arrow schema(s) will correspond to your data >> source. >>> In >>>>>>> some >>>>>>>> complex cases it can be advantageous to translate the same >>>> row-oriented >>>>>>>> data source into several Arrow schemas (e.g. OpenTelementry data >>>>>>> sources). >>>>>>>> >>>>>>>> - Decide on the encoding of the columns to make the most of the >>>>>>>> column-oriented format and thus increase the compression rate >> (e.g. >>>>>>> define >>>>>>>> the columns that should be represent as dictionaries). >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> By experience, I can attest that this process is usually >> iterative. >>>> For >>>>>>>> non-trivial data sources, arriving at the arrow representation >> that >>>>>>> offers >>>>>>>> the best compression ratio and is still perfectly usable and >>> queryable >>>>>>> is a >>>>>>>> long and tedious process. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I see two approaches to ease this process and consequently >> increase >>>> the >>>>>>>> adoption of Apache Arrow: >>>>>>>> >>>>>>>> - Definition of a canonical in-memory row format specification >> that >>>> every >>>>>>>> row-oriented data source provider can progressively adopt to get >> an >>>>>>>> automatic translation into the Arrow format. >>>>>>>> >>>>>>>> - Definition of an integration library allowing to map any >>>> row-oriented >>>>>>>> source into a generic row-oriented source understood by the >>>> converter. It >>>>>>>> is not about defining a unique in-memory format but more about >>>> defining a >>>>>>>> standard API to represent row-oriented data. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> In my opinion these two approaches are complementary. The first >>> option >>>>>>> is a >>>>>>>> long-term approach targeting directly the data providers, which >> will >>>>>>>> require to agree on this generic row-oriented format and whose >>>> adoption >>>>>>>> will be more or less long. The second approach does not directly >>>> require >>>>>>>> the collaboration of data source providers but allows an >>> "integrator" >>>> to >>>>>>>> perform this transformation painlessly with potentially several >>>>>>>> representation trials to achieve the best results in his context. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> The current proposal is an implementation of the second approach, >>>> i.e. an >>>>>>>> API that maps a row-oriented source XYZ into an intermediate >>>> row-oriented >>>>>>>> representation understood mechanically by the translator. This >>>> translator >>>>>>>> also adds a series of optimizations to make the most of the Arrow >>>> format. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> You can find multiple examples of a such transformation in the >>>> following >>>>>>>> examples: >>>>>>>> >>>>>>>> - >>>>>>>> >>>>>>>> >>>>>>> >>>> >>> >> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/trace/otlp_to_arrow.go >>>>>>>> this example converts OTEL trace entities into their >> corresponding >>>>>>> Arrow >>>>>>>> IR. At the end of this conversion the method returns a collection >>> of >>>>>>>> Arrow >>>>>>>> Records. >>>>>>>> - A more complex example can be found here >>>>>>>> >>>>>>>> >>>>>>> >>>> >>> >> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/metrics/otlp_to_arrow.go >>>>>>>> . >>>>>>>> In this example a stream of OTEL univariate row-oriented metrics >>> are >>>>>>>> translate into multivariate row-oriented metrics and then >>>>>>> automatically >>>>>>>> translated into Apache Records. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> In these two examples, the creation of dictionaries and >> multi-column >>>>>>>> sorting is automatically done by the framework and the developer >>>> doesn’t >>>>>>>> have to worry about the definition of Arrow schemas. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Now let's get to the answers. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> @David Lee, I don't think Parquet and from_pylist() solve this >>> problem >>>>>>>> particularly well. Parquet is a column-oriented data file format >> and >>>>>>>> doesn't really help to perform this transformation. The Python >>> method >>>> is >>>>>>>> relatively limited and language specific. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> @Micah Kornfield, as described before my goal is not to define a >>>> memory >>>>>>>> layout specification but more to define an API and a translation >>>>>>> mechanism >>>>>>>> able to take this intermediate representation (list of generic >>> objects >>>>>>>> representing the entities to translate) and to convert it into one >>> or >>>>>>> more >>>>>>>> Arrow records. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> @Wes McKinney, If I interpret your answer correctly, I think you >> are >>>>>>>> describing the option 1 mentioned above. Like you I think it is an >>>>>>>> interesting approach although complementary to the one I propose. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Looking forward to your feedback. >>>>>>>> >>>>>>>> On Wed, Jul 27, 2022 at 4:19 PM Wes McKinney <wesmck...@gmail.com >>> >>>>>>> wrote: >>>>>>>> >>>>>>>>> We had an e-mail thread about this in 2018 >>>>>>>>> >>>>>>>>> https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm >>>>>>>>> >>>>>>>>> I still think having a canonical in-memory row format (and >>> libraries >>>>>>>>> to transform to and from Arrow columnar format) is a good idea — >>> but >>>>>>>>> there is the risk of ending up in the tar pit of reinventing >> Avro. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield < >>>> emkornfi...@gmail.com >>>>>>>> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Are there more details on what exactly an "Arrow Intermediate >>>>>>>>>> Representation (AIR)" is? We've talked about in the past maybe >>>>>>> having >>>>>>>> a >>>>>>>>>> memory layout specification for row-based data as well as column >>>>>>> based >>>>>>>>>> data. There was also a recent attempt at least in C++ to try to >>>>>>> build >>>>>>>>>> utilities to do these pivots but it was decided that it didn't >> add >>>>>>> much >>>>>>>>>> utility (it was added a comprehensive example). >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Micah >>>>>>>>>> >>>>>>>>>> On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel < >>>>>>>> laurent.que...@gmail.com >>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> In the context of this OTEP >>>>>>>>>>> < >>>>>>>>> >>>>>>>> >>>>>>> >>>> >>> >> https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md >>>>>>>>>>>> >>>>>>>>>>> (OpenTelemetry >>>>>>>>>>> Enhancement Proposal) I developed an integration layer on top >> of >>>>>>>> Apache >>>>>>>>>>> Arrow (Go an Rust) to *facilitate the translation of >> row-oriented >>>>>>>> data >>>>>>>>>>> stream into an arrow-based columnar representation*. In this >>>>>>>> particular >>>>>>>>>>> case the goal was to translate all OpenTelemetry entities >>> (metrics, >>>>>>>>> logs, >>>>>>>>>>> or traces) into Apache Arrow records. These entities can be >> quite >>>>>>>>> complex >>>>>>>>>>> and their corresponding Arrow schema must be defined on the >> fly. >>>>>>> IMO, >>>>>>>>> this >>>>>>>>>>> approach is not specific to my specific needs but could be used >>> in >>>>>>>> many >>>>>>>>>>> other contexts where there is a need to simplify the >> integration >>>>>>>>> between a >>>>>>>>>>> row-oriented source of data and Apache Arrow. The trade-off is >> to >>>>>>>> have >>>>>>>>> to >>>>>>>>>>> perform the additional step of conversion to the intermediate >>>>>>>>>>> representation, but this transformation does not require to >>>>>>>> understand >>>>>>>>> the >>>>>>>>>>> arcana of the Arrow format and allows to potentially benefit >> from >>>>>>>>>>> functionalities such as the encoding of the dictionary "for >>> free", >>>>>>>> the >>>>>>>>>>> automatic generation of Arrow schemas, the batching, the >>>>>>> multi-column >>>>>>>>>>> sorting, etc. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I know that JSON can be used as a kind of intermediate >>>>>>> representation >>>>>>>>> in >>>>>>>>>>> the context of Arrow with some language specific >> implementation. >>>>>>>>> Current >>>>>>>>>>> JSON integrations are insufficient to cover the most complex >>>>>>>> scenarios >>>>>>>>> and >>>>>>>>>>> are not standardized; e.g. support for most of the Arrow data >>> type, >>>>>>>>> various >>>>>>>>>>> optimizations (string|binary dictionaries, multi-column >> sorting), >>>>>>>>> batching, >>>>>>>>>>> integration with Arrow IPC, compression ratio optimization, ... >>> The >>>>>>>>> object >>>>>>>>>>> of this proposal is to progressively cover these gaps. >>>>>>>>>>> >>>>>>>>>>> I am looking to see if the community would be interested in >> such >>> a >>>>>>>>>>> contribution. Above are some additional details on the current >>>>>>>>>>> implementation. All feedback is welcome. >>>>>>>>>>> >>>>>>>>>>> 10K ft overview of the current implementation: >>>>>>>>>>> >>>>>>>>>>> 1. Developers convert their row oriented stream into records >>>>>>> based >>>>>>>>> on >>>>>>>>>>> the Arrow Intermediate Representation (AIR). At this stage the >>>>>>>>>>> translation >>>>>>>>>>> can be quite mechanical but if needed developers can decide >> for >>>>>>>>> example >>>>>>>>>>> to >>>>>>>>>>> translate a map into a struct if that makes sense for them. >> The >>>>>>>>> current >>>>>>>>>>> implementation support the following arrow data types: bool, >> all >>>>>>>>> uints, >>>>>>>>>>> all >>>>>>>>>>> ints, all floats, string, binary, list of any supported types, >>>>>>> and >>>>>>>>>>> struct >>>>>>>>>>> of any supported types. Additional Arrow types could be added >>>>>>>>>>> progressively. >>>>>>>>>>> 2. The row oriented record (i.e. AIR record) is then added to >> a >>>>>>>>>>> RecordRepository. This repository will first compute a schema >>>>>>>>> signature >>>>>>>>>>> and >>>>>>>>>>> will route the record to a RecordBatcher based on this >>>>>>> signature. >>>>>>>>>>> 3. The RecordBatcher is responsible for collecting all the >>>>>>>>> compatible >>>>>>>>>>> AIR records and, upon request, the "batcher" is able to build >> an >>>>>>>>> Arrow >>>>>>>>>>> Record representing a batch of compatible inputs. In the >> current >>>>>>>>>>> implementation, the batcher is able to convert string columns >> to >>>>>>>>>>> dictionary >>>>>>>>>>> based on a configuration. Another configuration allows to >>>>>>> evaluate >>>>>>>>> which >>>>>>>>>>> columns should be sorted to optimize the compression ratio. >> The >>>>>>>> same >>>>>>>>>>> optimization process could be applied to binary columns. >>>>>>>>>>> 4. Steps 1 through 3 can be repeated on the same >>>>>>> RecordRepository >>>>>>>>>>> instance to build new sets of arrow record batches. Subsequent >>>>>>>>>>> iterations >>>>>>>>>>> will be slightly faster due to different techniques used (e.g. >>>>>>>>> object >>>>>>>>>>> reuse, dictionary reuse and sorting, ...) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The current Go implementation >>>>>>>>>>> <https://github.com/lquerel/otel-arrow-adapter> (WIP) is >>> currently >>>>>>>>> part of >>>>>>>>>>> this repo (see pkg/air package). If the community is >> interested, >>> I >>>>>>>>> could do >>>>>>>>>>> a PR in the Arrow Go and Rust sub-projects. >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Laurent Quérel >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Laurent Quérel >>>> >>> >>> >>> -- >>> Laurent Quérel >>> >> > > > -- > Laurent Quérel