Hi Gavin, I was not aware of this initiative but indeed, these two proposals have much in common. The implementation I am working on is available here https://github.com/lquerel/otel-arrow-adapter (directory pkg/air). I would be happy to get your feedback and identify with you the possible gaps to cover your specific use case. Best, Laurent
On Thu, Jul 28, 2022 at 5:43 PM Gavin Ray <ray.gavi...@gmail.com> wrote: > This is essentially the same idea as the proposal here I think -- > row/map-based representation & conversion functions for ease of use: > > [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, > increase adoption/audience and productivity. · Issue #12618 · apache/arrow > (github.com) <https://github.com/apache/arrow/issues/12618> > > Definitely a worthwhile pursuit IMO. > > On Thu, Jul 28, 2022 at 4:46 PM Sasha Krassovsky < > krassovskysa...@gmail.com> > wrote: > > > Hi everyone, > > I just wanted to chime in that we already do have a form of row-oriented > > storage inside of `arrow/compute/row/row_internal.h`. It is used to store > > rows inside of GroupBy and Join within Acero. We also have utilities for > > converting to/from columnar storage (and AVX2 implementations of these > > conversions) inside of `arrow/compute/row/encode_internal.h`. Would it be > > useful to standardize this row-oriented format? > > > > As far as I understand fixed-width rows would be trivially convertible > > into this representation (just a pointer to your array of structs), while > > variable-width rows would need a little bit of massaging (though not too > > much) to be put into this representation. > > > > Sasha Krassovsky > > > > > On Jul 28, 2022, at 1:10 PM, Laurent Quérel <laurent.que...@gmail.com> > > wrote: > > > > > > Thank you Micah for a very clear summary of the intent behind this > > > proposal. Indeed, I think that clarifying from the beginning that this > > > approach aims at facilitating experimentation more than efficiency in > > terms > > > of performance of the transformation phase would have helped to better > > > understand my objective. > > > > > > Regarding your question, I don't think there is a specific technical > > reason > > > for such an integration in the core library. I was just thinking that > it > > > would make this infrastructure easier to find for the users and that > this > > > topic was general enough to find its place in the standard library. > > > > > > Best, > > > Laurent > > > > > > On Thu, Jul 28, 2022 at 12:50 PM Micah Kornfield < > emkornfi...@gmail.com> > > > wrote: > > > > > >> Hi Laurent, > > >> I'm retitling this thread to include the specific languages you seem > to > > be > > >> targeting in the subject line to hopefully get more eyes from > > maintainers > > >> in those languages. > > >> > > >> Thanks for clarifying the goals. If I can restate my understanding, > the > > >> intended use-case here is to provide easy (from the developer point of > > >> view) adaptation of row based formats to Arrow. The means of > achieving > > >> this is creating an API for a row-base structure, and having utility > > >> classes that can manipulate the interface to build up batches (there > > are no > > >> serialization or in memory spec associated with this API). People > > wishing > > >> to integrate a specific row based format, can extend that API at > > whatever > > >> level makes sense for the format. > > >> > > >> I think this would be useful infrastructure as long as it was made > clear > > >> that in many cases this wouldn't be the most efficient way to convert > to > > >> Arrow from other formats. > > >> > > >> I don't work much with either the Rust or Go implementation, so I > can't > > >> speak to if there is maintainer support for incorporating the changes > > >> directly in Arrow. Is there any technical reasons for preferring to > > have > > >> this included directly in Arrow vs a separate library? > > >> > > >> Cheers, > > >> Micah > > >> > > >> On Thu, Jul 28, 2022 at 12:34 PM Laurent Quérel < > > laurent.que...@gmail.com> > > >> wrote: > > >> > > >>> Far be it from me to think that I know more than Jorge or Wes on this > > >>> subject. Sorry if my post gives that perception, that is clearly not > my > > >>> intention. I'm just trying to defend the idea that when designing > this > > >> kind > > >>> of transformation, it might be interesting to have a library to test > > >>> several mappings and evaluate them before doing a more direct > > >>> implementation if the performance is not there. > > >>> > > >>> On Thu, Jul 28, 2022 at 12:15 PM Benjamin Blodgett < > > >>> benjaminblodg...@gmail.com> wrote: > > >>> > > >>>> He was trying to nicely say he knows way more than you, and your > ideas > > >>>> will result in a low performance scheme no one will use in > production > > >>>> ai/machine learning. > > >>>> > > >>>> Sent from my iPhone > > >>>> > > >>>>> On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett < > > >>>> benjaminblodg...@gmail.com> wrote: > > >>>>> > > >>>>> I think Jorge’s opinion has is that of an expert and him being > > >> humble > > >>>> is just being tactful. Probably listen to Jorge on performance and > > >>>> architecture, even over Wes as he’s contributed more than anyone > else > > >> and > > >>>> know the bleeding edge of low level performance stuff more than > > anyone. > > >>>>> > > >>>>> Sent from my iPhone > > >>>>> > > >>>>>> On Jul 28, 2022, at 12:03 PM, Laurent Quérel < > > >>> laurent.que...@gmail.com> > > >>>> wrote: > > >>>>>> > > >>>>>> Hi Jorge > > >>>>>> > > >>>>>> I don't think that the level of in-depth knowledge needed is the > > >> same > > >>>>>> between using a row-oriented internal representation and "Arrow" > > >> which > > >>>> not > > >>>>>> only changes the organization of the data but also introduces a > set > > >> of > > >>>>>> additional mapping choices and concepts. > > >>>>>> > > >>>>>> For example, assuming that the initial row-oriented data source > is a > > >>>> stream > > >>>>>> of nested assembly of structures, lists and maps. The mapping of > > >> such > > >>> a > > >>>>>> stream to Protobuf, JSON, YAML, ... is straightforward because on > > >> both > > >>>>>> sides the logical representation is exactly the same, the schema > is > > >>>>>> sometimes optional, the interest of building batches is optional, > > >> ... > > >>> In > > >>>>>> the case of "Arrow" things are different - the schema and the > > >> batching > > >>>> are > > >>>>>> mandatory. The mapping is not necessarily direct and will > generally > > >> be > > >>>> the > > >>>>>> result of the combination of several trade-offs (normalization vs > > >>>>>> denormalization representation, mapping influencing the > compression > > >>>> rate, > > >>>>>> queryability with Arrow processors like DataFusion, ...). Note > that > > >>>> some of > > >>>>>> these complexities are not intrinsically linked to the fact that > the > > >>>> target > > >>>>>> format is column oriented. The ZST format ( > > >>>>>> https://zed.brimdata.io/docs/formats/zst/) for example does not > > >>>> require an > > >>>>>> explicit schema definition. > > >>>>>> > > >>>>>> IMHO, having a library that allows you to easily experiment with > > >>>> different > > >>>>>> types of mapping (without having to worry about batching, > > >>> dictionaries, > > >>>>>> schema definition, understanding how lists of structs are > > >> represented, > > >>>> ...) > > >>>>>> and to evaluate the results according to your specific goals has a > > >>> value > > >>>>>> (especially if your criteria are compression ratio and > > >> queryability). > > >>> Of > > >>>>>> course there is an overhead to such an approach. In some cases, at > > >> the > > >>>> end > > >>>>>> of the process, it will be necessary to manually perform this > direct > > >>>>>> transformation between a row-oriented XYZ format and "Arrow". > > >> However, > > >>>> this > > >>>>>> effort will be done after a simple experimentation phase to avoid > > >>>> changes > > >>>>>> in the implementation of the converter which in my opinion is not > so > > >>>> simple > > >>>>>> to implement with the current Arrow API. > > >>>>>> > > >>>>>> If the Arrow developer community is not interested in integrating > > >> this > > >>>>>> proposal, I plan to release two independent libraries (Go and > Rust) > > >>> that > > >>>>>> can be used on top of the standard "Arrow" libraries. This will > have > > >>> the > > >>>>>> advantage to evaluate if such an approach is able to raise > interest > > >>>> among > > >>>>>> Arrow users. > > >>>>>> > > >>>>>> Best, > > >>>>>> > > >>>>>> Laurent > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão < > > >>>>>>> jorgecarlei...@gmail.com> wrote: > > >>>>>>> > > >>>>>>> Hi Laurent, > > >>>>>>> > > >>>>>>> I agree that there is a common pattern in converting row-based > > >>> formats > > >>>> to > > >>>>>>> Arrow. > > >>>>>>> > > >>>>>>> Imho the difficult part is not to map the storage format to Arrow > > >>>>>>> specifically - it is to map the storage format to any in-memory > > >> (row- > > >>>> or > > >>>>>>> columnar- based) format, since it requires in-depth knowledge > about > > >>>> the 2 > > >>>>>>> formats (the source format and the target format). > > >>>>>>> > > >>>>>>> - Understanding the Arrow API which can be challenging for > complex > > >>>> cases of > > >>>>>>>> rows representing complex objects (list of struct, struct of > > >> struct, > > >>>>>>> ...). > > >>>>>>>> > > >>>>>>> > > >>>>>>> the developer would have the same problem - just shifted around - > > >>> they > > >>>> now > > >>>>>>> need to convert their complex objects to the intermediary > > >>>> representation. > > >>>>>>> Whether it is more "difficult" or "complex" to learn than Arrow > is > > >> an > > >>>> open > > >>>>>>> question, but we would essentially be shifting the problem from > > >>>> "learning > > >>>>>>> Arrow" to "learning the Intermediate in-memory". > > >>>>>>> > > >>>>>>> @Micah Kornfield, as described before my goal is not to define a > > >>> memory > > >>>>>>>> layout specification but more to define an API and a translation > > >>>>>>> mechanism > > >>>>>>>> able to take this intermediate representation (list of generic > > >>> objects > > >>>>>>>> representing the entities to translate) and to convert it into > one > > >>> or > > >>>>>>> more > > >>>>>>>> Arrow records. > > >>>>>>>> > > >>>>>>> > > >>>>>>> imho a spec of "list of generic objects representing the > entities" > > >> is > > >>>>>>> specified by an in-memory format (not by an API spec). > > >>>>>>> > > >>>>>>> A second challenge I anticipate is that in-memory formats > > >> inneerently > > >>>> "own" > > >>>>>>> the memory they outline (since by definition they outline how > this > > >>>> memory > > >>>>>>> is outlined). An Intermediate in-memory representation would be > no > > >>>>>>> different. Since row-based formats usually require at least one > > >>>> allocation > > >>>>>>> per row (and often more for variable-length types), the > > >>> transformation > > >>>>>>> (storage format -> row-based in-memory format -> Arrow) incurs a > > >>>>>>> significant cost (~2x slower last time I played with this problem > > >> in > > >>>> JSON > > >>>>>>> [1]). > > >>>>>>> > > >>>>>>> A third challenge I anticipate is that given that we have 10+ > > >>>> languages, we > > >>>>>>> would eventually need to convert the intermediary representation > > >>> across > > >>>>>>> languages, which imo just hints that we would need to formalize > an > > >>>> agnostic > > >>>>>>> spec for such representation (so languages agree on its > > >>>> representation), > > >>>>>>> and thus essentially declare a new (row-based) format. > > >>>>>>> > > >>>>>>> (none of this precludes efforts to invent an in-memory row format > > >> for > > >>>>>>> analytics workloads) > > >>>>>>> > > >>>>>>> @Wes McKinney <wesmck...@gmail.com> > > >>>>>>> > > >>>>>>> I still think having a canonical in-memory row format (and > > >> libraries > > >>>>>>>> to transform to and from Arrow columnar format) is a good idea — > > >> but > > >>>>>>>> there is the risk of ending up in the tar pit of reinventing > Avro. > > >>>>>>>> > > >>>>>>> > > >>>>>>> afaik Avro does not have O(1) random access neither to its rows > nor > > >>>> columns > > >>>>>>> - records are concatenated back to back, every record's column is > > >>>>>>> concatenated back to back within a record, and there is no > indexing > > >>>>>>> information on how to access a particular row or column. There > are > > >>>> blocks > > >>>>>>> of rows that reduce the cost of accessing large offsets, but imo > it > > >>> is > > >>>> far > > >>>>>>> from the O(1) offered by Arrow (and expected by analytics > > >> workloads). > > >>>>>>> > > >>>>>>> [1] https://github.com/jorgecarleitao/arrow2/pull/1024 > > >>>>>>> > > >>>>>>> Best, > > >>>>>>> Jorge > > >>>>>>> > > >>>>>>> On Thu, Jul 28, 2022 at 5:38 AM Laurent Quérel < > > >>>> laurent.que...@gmail.com> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>>> Let me clarify the proposal a bit before replying to the various > > >>>> previous > > >>>>>>>> feedbacks. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> It seems to me that the process of converting a row-oriented > data > > >>>> source > > >>>>>>>> (row = set of fields or something more hierarchical) into an > Arrow > > >>>> record > > >>>>>>>> repeatedly raises the same challenges. A developer who must > > >> perform > > >>>> this > > >>>>>>>> kind of transformation is confronted with the following > questions > > >>> and > > >>>>>>>> problems: > > >>>>>>>> > > >>>>>>>> - Understanding the Arrow API which can be challenging for > complex > > >>>> cases > > >>>>>>> of > > >>>>>>>> rows representing complex objects (list of struct, struct of > > >> struct, > > >>>>>>> ...). > > >>>>>>>> > > >>>>>>>> - Decide which Arrow schema(s) will correspond to your data > > >> source. > > >>> In > > >>>>>>> some > > >>>>>>>> complex cases it can be advantageous to translate the same > > >>>> row-oriented > > >>>>>>>> data source into several Arrow schemas (e.g. OpenTelementry data > > >>>>>>> sources). > > >>>>>>>> > > >>>>>>>> - Decide on the encoding of the columns to make the most of the > > >>>>>>>> column-oriented format and thus increase the compression rate > > >> (e.g. > > >>>>>>> define > > >>>>>>>> the columns that should be represent as dictionaries). > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> By experience, I can attest that this process is usually > > >> iterative. > > >>>> For > > >>>>>>>> non-trivial data sources, arriving at the arrow representation > > >> that > > >>>>>>> offers > > >>>>>>>> the best compression ratio and is still perfectly usable and > > >>> queryable > > >>>>>>> is a > > >>>>>>>> long and tedious process. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> I see two approaches to ease this process and consequently > > >> increase > > >>>> the > > >>>>>>>> adoption of Apache Arrow: > > >>>>>>>> > > >>>>>>>> - Definition of a canonical in-memory row format specification > > >> that > > >>>> every > > >>>>>>>> row-oriented data source provider can progressively adopt to get > > >> an > > >>>>>>>> automatic translation into the Arrow format. > > >>>>>>>> > > >>>>>>>> - Definition of an integration library allowing to map any > > >>>> row-oriented > > >>>>>>>> source into a generic row-oriented source understood by the > > >>>> converter. It > > >>>>>>>> is not about defining a unique in-memory format but more about > > >>>> defining a > > >>>>>>>> standard API to represent row-oriented data. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> In my opinion these two approaches are complementary. The first > > >>> option > > >>>>>>> is a > > >>>>>>>> long-term approach targeting directly the data providers, which > > >> will > > >>>>>>>> require to agree on this generic row-oriented format and whose > > >>>> adoption > > >>>>>>>> will be more or less long. The second approach does not directly > > >>>> require > > >>>>>>>> the collaboration of data source providers but allows an > > >>> "integrator" > > >>>> to > > >>>>>>>> perform this transformation painlessly with potentially several > > >>>>>>>> representation trials to achieve the best results in his > context. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> The current proposal is an implementation of the second > approach, > > >>>> i.e. an > > >>>>>>>> API that maps a row-oriented source XYZ into an intermediate > > >>>> row-oriented > > >>>>>>>> representation understood mechanically by the translator. This > > >>>> translator > > >>>>>>>> also adds a series of optimizations to make the most of the > Arrow > > >>>> format. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> You can find multiple examples of a such transformation in the > > >>>> following > > >>>>>>>> examples: > > >>>>>>>> > > >>>>>>>> - > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>> > > >>> > > >> > > > https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/trace/otlp_to_arrow.go > > >>>>>>>> this example converts OTEL trace entities into their > > >> corresponding > > >>>>>>> Arrow > > >>>>>>>> IR. At the end of this conversion the method returns a > collection > > >>> of > > >>>>>>>> Arrow > > >>>>>>>> Records. > > >>>>>>>> - A more complex example can be found here > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>> > > >>> > > >> > > > https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/metrics/otlp_to_arrow.go > > >>>>>>>> . > > >>>>>>>> In this example a stream of OTEL univariate row-oriented metrics > > >>> are > > >>>>>>>> translate into multivariate row-oriented metrics and then > > >>>>>>> automatically > > >>>>>>>> translated into Apache Records. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> In these two examples, the creation of dictionaries and > > >> multi-column > > >>>>>>>> sorting is automatically done by the framework and the developer > > >>>> doesn’t > > >>>>>>>> have to worry about the definition of Arrow schemas. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Now let's get to the answers. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> @David Lee, I don't think Parquet and from_pylist() solve this > > >>> problem > > >>>>>>>> particularly well. Parquet is a column-oriented data file format > > >> and > > >>>>>>>> doesn't really help to perform this transformation. The Python > > >>> method > > >>>> is > > >>>>>>>> relatively limited and language specific. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> @Micah Kornfield, as described before my goal is not to define a > > >>>> memory > > >>>>>>>> layout specification but more to define an API and a translation > > >>>>>>> mechanism > > >>>>>>>> able to take this intermediate representation (list of generic > > >>> objects > > >>>>>>>> representing the entities to translate) and to convert it into > one > > >>> or > > >>>>>>> more > > >>>>>>>> Arrow records. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> @Wes McKinney, If I interpret your answer correctly, I think you > > >> are > > >>>>>>>> describing the option 1 mentioned above. Like you I think it is > an > > >>>>>>>> interesting approach although complementary to the one I > propose. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Looking forward to your feedback. > > >>>>>>>> > > >>>>>>>> On Wed, Jul 27, 2022 at 4:19 PM Wes McKinney < > wesmck...@gmail.com > > >>> > > >>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> We had an e-mail thread about this in 2018 > > >>>>>>>>> > > >>>>>>>>> > https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm > > >>>>>>>>> > > >>>>>>>>> I still think having a canonical in-memory row format (and > > >>> libraries > > >>>>>>>>> to transform to and from Arrow columnar format) is a good idea > — > > >>> but > > >>>>>>>>> there is the risk of ending up in the tar pit of reinventing > > >> Avro. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield < > > >>>> emkornfi...@gmail.com > > >>>>>>>> > > >>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>> Are there more details on what exactly an "Arrow Intermediate > > >>>>>>>>>> Representation (AIR)" is? We've talked about in the past > maybe > > >>>>>>> having > > >>>>>>>> a > > >>>>>>>>>> memory layout specification for row-based data as well as > column > > >>>>>>> based > > >>>>>>>>>> data. There was also a recent attempt at least in C++ to try > to > > >>>>>>> build > > >>>>>>>>>> utilities to do these pivots but it was decided that it didn't > > >> add > > >>>>>>> much > > >>>>>>>>>> utility (it was added a comprehensive example). > > >>>>>>>>>> > > >>>>>>>>>> Thanks, > > >>>>>>>>>> Micah > > >>>>>>>>>> > > >>>>>>>>>> On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel < > > >>>>>>>> laurent.que...@gmail.com > > >>>>>>>>>> > > >>>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> In the context of this OTEP > > >>>>>>>>>>> < > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>> > > >>> > > >> > > > https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md > > >>>>>>>>>>>> > > >>>>>>>>>>> (OpenTelemetry > > >>>>>>>>>>> Enhancement Proposal) I developed an integration layer on top > > >> of > > >>>>>>>> Apache > > >>>>>>>>>>> Arrow (Go an Rust) to *facilitate the translation of > > >> row-oriented > > >>>>>>>> data > > >>>>>>>>>>> stream into an arrow-based columnar representation*. In this > > >>>>>>>> particular > > >>>>>>>>>>> case the goal was to translate all OpenTelemetry entities > > >>> (metrics, > > >>>>>>>>> logs, > > >>>>>>>>>>> or traces) into Apache Arrow records. These entities can be > > >> quite > > >>>>>>>>> complex > > >>>>>>>>>>> and their corresponding Arrow schema must be defined on the > > >> fly. > > >>>>>>> IMO, > > >>>>>>>>> this > > >>>>>>>>>>> approach is not specific to my specific needs but could be > used > > >>> in > > >>>>>>>> many > > >>>>>>>>>>> other contexts where there is a need to simplify the > > >> integration > > >>>>>>>>> between a > > >>>>>>>>>>> row-oriented source of data and Apache Arrow. The trade-off > is > > >> to > > >>>>>>>> have > > >>>>>>>>> to > > >>>>>>>>>>> perform the additional step of conversion to the intermediate > > >>>>>>>>>>> representation, but this transformation does not require to > > >>>>>>>> understand > > >>>>>>>>> the > > >>>>>>>>>>> arcana of the Arrow format and allows to potentially benefit > > >> from > > >>>>>>>>>>> functionalities such as the encoding of the dictionary "for > > >>> free", > > >>>>>>>> the > > >>>>>>>>>>> automatic generation of Arrow schemas, the batching, the > > >>>>>>> multi-column > > >>>>>>>>>>> sorting, etc. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> I know that JSON can be used as a kind of intermediate > > >>>>>>> representation > > >>>>>>>>> in > > >>>>>>>>>>> the context of Arrow with some language specific > > >> implementation. > > >>>>>>>>> Current > > >>>>>>>>>>> JSON integrations are insufficient to cover the most complex > > >>>>>>>> scenarios > > >>>>>>>>> and > > >>>>>>>>>>> are not standardized; e.g. support for most of the Arrow data > > >>> type, > > >>>>>>>>> various > > >>>>>>>>>>> optimizations (string|binary dictionaries, multi-column > > >> sorting), > > >>>>>>>>> batching, > > >>>>>>>>>>> integration with Arrow IPC, compression ratio optimization, > ... > > >>> The > > >>>>>>>>> object > > >>>>>>>>>>> of this proposal is to progressively cover these gaps. > > >>>>>>>>>>> > > >>>>>>>>>>> I am looking to see if the community would be interested in > > >> such > > >>> a > > >>>>>>>>>>> contribution. Above are some additional details on the > current > > >>>>>>>>>>> implementation. All feedback is welcome. > > >>>>>>>>>>> > > >>>>>>>>>>> 10K ft overview of the current implementation: > > >>>>>>>>>>> > > >>>>>>>>>>> 1. Developers convert their row oriented stream into records > > >>>>>>> based > > >>>>>>>>> on > > >>>>>>>>>>> the Arrow Intermediate Representation (AIR). At this stage > the > > >>>>>>>>>>> translation > > >>>>>>>>>>> can be quite mechanical but if needed developers can decide > > >> for > > >>>>>>>>> example > > >>>>>>>>>>> to > > >>>>>>>>>>> translate a map into a struct if that makes sense for them. > > >> The > > >>>>>>>>> current > > >>>>>>>>>>> implementation support the following arrow data types: bool, > > >> all > > >>>>>>>>> uints, > > >>>>>>>>>>> all > > >>>>>>>>>>> ints, all floats, string, binary, list of any supported > types, > > >>>>>>> and > > >>>>>>>>>>> struct > > >>>>>>>>>>> of any supported types. Additional Arrow types could be added > > >>>>>>>>>>> progressively. > > >>>>>>>>>>> 2. The row oriented record (i.e. AIR record) is then added to > > >> a > > >>>>>>>>>>> RecordRepository. This repository will first compute a schema > > >>>>>>>>> signature > > >>>>>>>>>>> and > > >>>>>>>>>>> will route the record to a RecordBatcher based on this > > >>>>>>> signature. > > >>>>>>>>>>> 3. The RecordBatcher is responsible for collecting all the > > >>>>>>>>> compatible > > >>>>>>>>>>> AIR records and, upon request, the "batcher" is able to build > > >> an > > >>>>>>>>> Arrow > > >>>>>>>>>>> Record representing a batch of compatible inputs. In the > > >> current > > >>>>>>>>>>> implementation, the batcher is able to convert string columns > > >> to > > >>>>>>>>>>> dictionary > > >>>>>>>>>>> based on a configuration. Another configuration allows to > > >>>>>>> evaluate > > >>>>>>>>> which > > >>>>>>>>>>> columns should be sorted to optimize the compression ratio. > > >> The > > >>>>>>>> same > > >>>>>>>>>>> optimization process could be applied to binary columns. > > >>>>>>>>>>> 4. Steps 1 through 3 can be repeated on the same > > >>>>>>> RecordRepository > > >>>>>>>>>>> instance to build new sets of arrow record batches. > Subsequent > > >>>>>>>>>>> iterations > > >>>>>>>>>>> will be slightly faster due to different techniques used > (e.g. > > >>>>>>>>> object > > >>>>>>>>>>> reuse, dictionary reuse and sorting, ...) > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> The current Go implementation > > >>>>>>>>>>> <https://github.com/lquerel/otel-arrow-adapter> (WIP) is > > >>> currently > > >>>>>>>>> part of > > >>>>>>>>>>> this repo (see pkg/air package). If the community is > > >> interested, > > >>> I > > >>>>>>>>> could do > > >>>>>>>>>>> a PR in the Arrow Go and Rust sub-projects. > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> -- > > >>>>>>>> Laurent Quérel > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>>> > > >>>>>> -- > > >>>>>> Laurent Quérel > > >>>> > > >>> > > >>> > > >>> -- > > >>> Laurent Quérel > > >>> > > >> > > > > > > > > > -- > > > Laurent Quérel > > > > > -- Laurent Quérel