Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Sasha Krassovsky Thu, 28 Jul 2022 13:46:34 -0700

Hi everyone, 
I just wanted to chime in that we already do have a form of row-oriented 
storage inside of `arrow/compute/row/row_internal.h`. It is used to store rows 
inside of GroupBy and Join within Acero. We also have utilities for converting 
to/from columnar storage (and AVX2 implementations of these conversions) inside 
of `arrow/compute/row/encode_internal.h`. Would it be useful to standardize 
this row-oriented format?


As far as I understand fixed-width rows would be trivially convertible into 
this representation (just a pointer to your array of structs), while 
variable-width rows would need a little bit of massaging (though not too much) 
to be put into this representation. 

Sasha Krassovsky

> On Jul 28, 2022, at 1:10 PM, Laurent Quérel <laurent.que...@gmail.com> wrote:
> 
> Thank you Micah for a very clear summary of the intent behind this
> proposal. Indeed, I think that clarifying from the beginning that this
> approach aims at facilitating experimentation more than efficiency in terms
> of performance of the transformation phase would have helped to better
> understand my objective.
> 
> Regarding your question, I don't think there is a specific technical reason
> for such an integration in the core library. I was just thinking that it
> would make this infrastructure easier to find for the users and that this
> topic was general enough to find its place in the standard library.
> 
> Best,
> Laurent
> 
> On Thu, Jul 28, 2022 at 12:50 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> 
>> Hi Laurent,
>> I'm retitling this thread to include the specific languages you seem to be
>> targeting in the subject line to hopefully get more eyes from maintainers
>> in those languages.
>> 
>> Thanks for clarifying the goals.  If I can restate my understanding, the
>> intended use-case here is to provide easy (from the developer point of
>> view) adaptation of row based formats to Arrow.  The means of achieving
>> this is creating an API for a row-base structure, and having utility
>> classes that can manipulate the interface to build up batches (there are no
>> serialization or in memory spec associated with this API).  People wishing
>> to integrate a specific row based format, can extend that API at whatever
>> level makes sense for the format.
>> 
>> I think this would be useful infrastructure as long as it was made clear
>> that in many cases this wouldn't be the most efficient way to convert to
>> Arrow from other formats.
>> 
>> I don't work much with either the Rust or Go implementation, so I can't
>> speak to if there is maintainer support for incorporating the changes
>> directly in Arrow.  Is there any technical reasons for preferring to have
>> this included directly in Arrow vs a separate library?
>> 
>> Cheers,
>> Micah
>> 
>> On Thu, Jul 28, 2022 at 12:34 PM Laurent Quérel <laurent.que...@gmail.com>
>> wrote:
>> 
>>> Far be it from me to think that I know more than Jorge or Wes on this
>>> subject. Sorry if my post gives that perception, that is clearly not my
>>> intention. I'm just trying to defend the idea that when designing this
>> kind
>>> of transformation, it might be interesting to have a library to test
>>> several mappings and evaluate them before doing a more direct
>>> implementation if the performance is not there.
>>> 
>>> On Thu, Jul 28, 2022 at 12:15 PM Benjamin Blodgett <
>>> benjaminblodg...@gmail.com> wrote:
>>> 
>>>> He was trying to nicely say he knows way more than you, and your ideas
>>>> will result in a low performance scheme no one will use in production
>>>> ai/machine learning.
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett <
>>>> benjaminblodg...@gmail.com> wrote:
>>>>> 
>>>>> I think Jorge’s opinion has is that of an expert and him being
>> humble
>>>> is just being tactful.  Probably listen to Jorge on performance and
>>>> architecture, even over Wes as he’s contributed more than anyone else
>> and
>>>> know the bleeding edge of low level performance stuff more than anyone.
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On Jul 28, 2022, at 12:03 PM, Laurent Quérel <
>>> laurent.que...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>> Hi Jorge
>>>>>> 
>>>>>> I don't think that the level of in-depth knowledge needed is the
>> same
>>>>>> between using a row-oriented internal representation and "Arrow"
>> which
>>>> not
>>>>>> only changes the organization of the data but also introduces a set
>> of
>>>>>> additional mapping choices and concepts.
>>>>>> 
>>>>>> For example, assuming that the initial row-oriented data source is a
>>>> stream
>>>>>> of nested assembly of structures, lists and maps. The mapping of
>> such
>>> a
>>>>>> stream to Protobuf, JSON, YAML, ... is straightforward because on
>> both
>>>>>> sides the logical representation is exactly the same, the schema is
>>>>>> sometimes optional, the interest of building batches is optional,
>> ...
>>> In
>>>>>> the case of "Arrow" things are different - the schema and the
>> batching
>>>> are
>>>>>> mandatory. The mapping is not necessarily direct and will generally
>> be
>>>> the
>>>>>> result of the combination of several trade-offs (normalization vs
>>>>>> denormalization representation, mapping influencing the compression
>>>> rate,
>>>>>> queryability with Arrow processors like DataFusion, ...). Note that
>>>> some of
>>>>>> these complexities are not intrinsically linked to the fact that the
>>>> target
>>>>>> format is column oriented. The ZST format (
>>>>>> https://zed.brimdata.io/docs/formats/zst/) for example does not
>>>> require an
>>>>>> explicit schema definition.
>>>>>> 
>>>>>> IMHO, having a library that allows you to easily experiment with
>>>> different
>>>>>> types of mapping (without having to worry about batching,
>>> dictionaries,
>>>>>> schema definition, understanding how lists of structs are
>> represented,
>>>> ...)
>>>>>> and to evaluate the results according to your specific goals has a
>>> value
>>>>>> (especially if your criteria are compression ratio and
>> queryability).
>>> Of
>>>>>> course there is an overhead to such an approach. In some cases, at
>> the
>>>> end
>>>>>> of the process, it will be necessary to manually perform this direct
>>>>>> transformation between a row-oriented XYZ format and "Arrow".
>> However,
>>>> this
>>>>>> effort will be done after a simple experimentation phase to avoid
>>>> changes
>>>>>> in the implementation of the converter which in my opinion is not so
>>>> simple
>>>>>> to implement with the current Arrow API.
>>>>>> 
>>>>>> If the Arrow developer community is not interested in integrating
>> this
>>>>>> proposal, I plan to release two independent libraries (Go and Rust)
>>> that
>>>>>> can be used on top of the standard "Arrow" libraries. This will have
>>> the
>>>>>> advantage to evaluate if such an approach is able to raise interest
>>>> among
>>>>>> Arrow users.
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> Laurent
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão <
>>>>>>> jorgecarlei...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi Laurent,
>>>>>>> 
>>>>>>> I agree that there is a common pattern in converting row-based
>>> formats
>>>> to
>>>>>>> Arrow.
>>>>>>> 
>>>>>>> Imho the difficult part is not to map the storage format to Arrow
>>>>>>> specifically - it is to map the storage format to any in-memory
>> (row-
>>>> or
>>>>>>> columnar- based) format, since it requires in-depth knowledge about
>>>> the 2
>>>>>>> formats (the source format and the target format).
>>>>>>> 
>>>>>>> - Understanding the Arrow API which can be challenging for complex
>>>> cases of
>>>>>>>> rows representing complex objects (list of struct, struct of
>> struct,
>>>>>>> ...).
>>>>>>>> 
>>>>>>> 
>>>>>>> the developer would have the same problem - just shifted around -
>>> they
>>>> now
>>>>>>> need to convert their complex objects to the intermediary
>>>> representation.
>>>>>>> Whether it is more "difficult" or "complex" to learn than Arrow is
>> an
>>>> open
>>>>>>> question, but we would essentially be shifting the problem from
>>>> "learning
>>>>>>> Arrow" to "learning the Intermediate in-memory".
>>>>>>> 
>>>>>>> @Micah Kornfield, as described before my goal is not to define a
>>> memory
>>>>>>>> layout specification but more to define an API and a translation
>>>>>>> mechanism
>>>>>>>> able to take this intermediate representation (list of generic
>>> objects
>>>>>>>> representing the entities to translate) and to convert it into one
>>> or
>>>>>>> more
>>>>>>>> Arrow records.
>>>>>>>> 
>>>>>>> 
>>>>>>> imho a spec of "list of generic objects representing the entities"
>> is
>>>>>>> specified by an in-memory format (not by an API spec).
>>>>>>> 
>>>>>>> A second challenge I anticipate is that in-memory formats
>> inneerently
>>>> "own"
>>>>>>> the memory they outline (since by definition they outline how this
>>>> memory
>>>>>>> is outlined). An Intermediate in-memory representation would be no
>>>>>>> different. Since row-based formats usually require at least one
>>>> allocation
>>>>>>> per row (and often more for variable-length types), the
>>> transformation
>>>>>>> (storage format -> row-based in-memory format -> Arrow) incurs a
>>>>>>> significant cost (~2x slower last time I played with this problem
>> in
>>>> JSON
>>>>>>> [1]).
>>>>>>> 
>>>>>>> A third challenge I anticipate is that given that we have 10+
>>>> languages, we
>>>>>>> would eventually need to convert the intermediary representation
>>> across
>>>>>>> languages, which imo just hints that we would need to formalize an
>>>> agnostic
>>>>>>> spec for such representation (so languages agree on its
>>>> representation),
>>>>>>> and thus essentially declare a new (row-based) format.
>>>>>>> 
>>>>>>> (none of this precludes efforts to invent an in-memory row format
>> for
>>>>>>> analytics workloads)
>>>>>>> 
>>>>>>> @Wes McKinney <wesmck...@gmail.com>
>>>>>>> 
>>>>>>> I still think having a canonical in-memory row format (and
>> libraries
>>>>>>>> to transform to and from Arrow columnar format) is a good idea —
>> but
>>>>>>>> there is the risk of ending up in the tar pit of reinventing Avro.
>>>>>>>> 
>>>>>>> 
>>>>>>> afaik Avro does not have O(1) random access neither to its rows nor
>>>> columns
>>>>>>> - records are concatenated back to back, every record's column is
>>>>>>> concatenated back to back within a record, and there is no indexing
>>>>>>> information on how to access a particular row or column. There are
>>>> blocks
>>>>>>> of rows that reduce the cost of accessing large offsets, but imo it
>>> is
>>>> far
>>>>>>> from the O(1) offered by Arrow (and expected by analytics
>> workloads).
>>>>>>> 
>>>>>>> [1] https://github.com/jorgecarleitao/arrow2/pull/1024
>>>>>>> 
>>>>>>> Best,
>>>>>>> Jorge
>>>>>>> 
>>>>>>> On Thu, Jul 28, 2022 at 5:38 AM Laurent Quérel <
>>>> laurent.que...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Let me clarify the proposal a bit before replying to the various
>>>> previous
>>>>>>>> feedbacks.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> It seems to me that the process of converting a row-oriented data
>>>> source
>>>>>>>> (row = set of fields or something more hierarchical) into an Arrow
>>>> record
>>>>>>>> repeatedly raises the same challenges. A developer who must
>> perform
>>>> this
>>>>>>>> kind of transformation is confronted with the following questions
>>> and
>>>>>>>> problems:
>>>>>>>> 
>>>>>>>> - Understanding the Arrow API which can be challenging for complex
>>>> cases
>>>>>>> of
>>>>>>>> rows representing complex objects (list of struct, struct of
>> struct,
>>>>>>> ...).
>>>>>>>> 
>>>>>>>> - Decide which Arrow schema(s) will correspond to your data
>> source.
>>> In
>>>>>>> some
>>>>>>>> complex cases it can be advantageous to translate the same
>>>> row-oriented
>>>>>>>> data source into several Arrow schemas (e.g. OpenTelementry data
>>>>>>> sources).
>>>>>>>> 
>>>>>>>> - Decide on the encoding of the columns to make the most of the
>>>>>>>> column-oriented format and thus increase the compression rate
>> (e.g.
>>>>>>> define
>>>>>>>> the columns that should be represent as dictionaries).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> By experience, I can attest that this process is usually
>> iterative.
>>>> For
>>>>>>>> non-trivial data sources, arriving at the arrow representation
>> that
>>>>>>> offers
>>>>>>>> the best compression ratio and is still perfectly usable and
>>> queryable
>>>>>>> is a
>>>>>>>> long and tedious process.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I see two approaches to ease this process and consequently
>> increase
>>>> the
>>>>>>>> adoption of Apache Arrow:
>>>>>>>> 
>>>>>>>> - Definition of a canonical in-memory row format specification
>> that
>>>> every
>>>>>>>> row-oriented data source provider can progressively adopt to get
>> an
>>>>>>>> automatic translation into the Arrow format.
>>>>>>>> 
>>>>>>>> - Definition of an integration library allowing to map any
>>>> row-oriented
>>>>>>>> source into a generic row-oriented source understood by the
>>>> converter. It
>>>>>>>> is not about defining a unique in-memory format but more about
>>>> defining a
>>>>>>>> standard API to represent row-oriented data.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> In my opinion these two approaches are complementary. The first
>>> option
>>>>>>> is a
>>>>>>>> long-term approach targeting directly the data providers, which
>> will
>>>>>>>> require to agree on this generic row-oriented format and whose
>>>> adoption
>>>>>>>> will be more or less long. The second approach does not directly
>>>> require
>>>>>>>> the collaboration of data source providers but allows an
>>> "integrator"
>>>> to
>>>>>>>> perform this transformation painlessly with potentially several
>>>>>>>> representation trials to achieve the best results in his context.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The current proposal is an implementation of the second approach,
>>>> i.e. an
>>>>>>>> API that maps a row-oriented source XYZ into an intermediate
>>>> row-oriented
>>>>>>>> representation understood mechanically by the translator. This
>>>> translator
>>>>>>>> also adds a series of optimizations to make the most of the Arrow
>>>> format.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> You can find multiple examples of a such transformation in the
>>>> following
>>>>>>>> examples:
>>>>>>>> 
>>>>>>>> -
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>> 
>>> 
>> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/trace/otlp_to_arrow.go
>>>>>>>> this example converts OTEL trace entities into their
>> corresponding
>>>>>>> Arrow
>>>>>>>> IR. At the end of this conversion the method returns a collection
>>> of
>>>>>>>> Arrow
>>>>>>>> Records.
>>>>>>>> - A more complex example can be found here
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>> 
>>> 
>> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/metrics/otlp_to_arrow.go
>>>>>>>> .
>>>>>>>> In this example a stream of OTEL univariate row-oriented metrics
>>> are
>>>>>>>> translate into multivariate row-oriented metrics and then
>>>>>>> automatically
>>>>>>>> translated into Apache Records.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> In these two examples, the creation of dictionaries and
>> multi-column
>>>>>>>> sorting is automatically done by the framework and the developer
>>>> doesn’t
>>>>>>>> have to worry about the definition of Arrow schemas.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Now let's get to the answers.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> @David Lee, I don't think Parquet and from_pylist() solve this
>>> problem
>>>>>>>> particularly well. Parquet is a column-oriented data file format
>> and
>>>>>>>> doesn't really help to perform this transformation. The Python
>>> method
>>>> is
>>>>>>>> relatively limited and language specific.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> @Micah Kornfield, as described before my goal is not to define a
>>>> memory
>>>>>>>> layout specification but more to define an API and a translation
>>>>>>> mechanism
>>>>>>>> able to take this intermediate representation (list of generic
>>> objects
>>>>>>>> representing the entities to translate) and to convert it into one
>>> or
>>>>>>> more
>>>>>>>> Arrow records.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> @Wes McKinney, If I interpret your answer correctly, I think you
>> are
>>>>>>>> describing the option 1 mentioned above. Like you I think it is an
>>>>>>>> interesting approach although complementary to the one I propose.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Looking forward to your feedback.
>>>>>>>> 
>>>>>>>> On Wed, Jul 27, 2022 at 4:19 PM Wes McKinney <wesmck...@gmail.com
>>> 
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> We had an e-mail thread about this in 2018
>>>>>>>>> 
>>>>>>>>> https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm
>>>>>>>>> 
>>>>>>>>> I still think having a canonical in-memory row format (and
>>> libraries
>>>>>>>>> to transform to and from Arrow columnar format) is a good idea —
>>> but
>>>>>>>>> there is the risk of ending up in the tar pit of reinventing
>> Avro.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield <
>>>> emkornfi...@gmail.com
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Are there more details on what exactly an "Arrow Intermediate
>>>>>>>>>> Representation (AIR)" is?  We've talked about in the past maybe
>>>>>>> having
>>>>>>>> a
>>>>>>>>>> memory layout specification for row-based data as well as column
>>>>>>> based
>>>>>>>>>> data.  There was also a recent attempt at least in C++ to try to
>>>>>>> build
>>>>>>>>>> utilities to do these pivots but it was decided that it didn't
>> add
>>>>>>> much
>>>>>>>>>> utility (it was added a comprehensive example).
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Micah
>>>>>>>>>> 
>>>>>>>>>> On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel <
>>>>>>>> laurent.que...@gmail.com
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> In the context of this OTEP
>>>>>>>>>>> <
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>> 
>>> 
>> https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
>>>>>>>>>>>> 
>>>>>>>>>>> (OpenTelemetry
>>>>>>>>>>> Enhancement Proposal) I developed an integration layer on top
>> of
>>>>>>>> Apache
>>>>>>>>>>> Arrow (Go an Rust) to *facilitate the translation of
>> row-oriented
>>>>>>>> data
>>>>>>>>>>> stream into an arrow-based columnar representation*. In this
>>>>>>>> particular
>>>>>>>>>>> case the goal was to translate all OpenTelemetry entities
>>> (metrics,
>>>>>>>>> logs,
>>>>>>>>>>> or traces) into Apache Arrow records. These entities can be
>> quite
>>>>>>>>> complex
>>>>>>>>>>> and their corresponding Arrow schema must be defined on the
>> fly.
>>>>>>> IMO,
>>>>>>>>> this
>>>>>>>>>>> approach is not specific to my specific needs but could be used
>>> in
>>>>>>>> many
>>>>>>>>>>> other contexts where there is a need to simplify the
>> integration
>>>>>>>>> between a
>>>>>>>>>>> row-oriented source of data and Apache Arrow. The trade-off is
>> to
>>>>>>>> have
>>>>>>>>> to
>>>>>>>>>>> perform the additional step of conversion to the intermediate
>>>>>>>>>>> representation, but this transformation does not require to
>>>>>>>> understand
>>>>>>>>> the
>>>>>>>>>>> arcana of the Arrow format and allows to potentially benefit
>> from
>>>>>>>>>>> functionalities such as the encoding of the dictionary "for
>>> free",
>>>>>>>> the
>>>>>>>>>>> automatic generation of Arrow schemas, the batching, the
>>>>>>> multi-column
>>>>>>>>>>> sorting, etc.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I know that JSON can be used as a kind of intermediate
>>>>>>> representation
>>>>>>>>> in
>>>>>>>>>>> the context of Arrow with some language specific
>> implementation.
>>>>>>>>> Current
>>>>>>>>>>> JSON integrations are insufficient to cover the most complex
>>>>>>>> scenarios
>>>>>>>>> and
>>>>>>>>>>> are not standardized; e.g. support for most of the Arrow data
>>> type,
>>>>>>>>> various
>>>>>>>>>>> optimizations (string|binary dictionaries, multi-column
>> sorting),
>>>>>>>>> batching,
>>>>>>>>>>> integration with Arrow IPC, compression ratio optimization, ...
>>> The
>>>>>>>>> object
>>>>>>>>>>> of this proposal is to progressively cover these gaps.
>>>>>>>>>>> 
>>>>>>>>>>> I am looking to see if the community would be interested in
>> such
>>> a
>>>>>>>>>>> contribution. Above are some additional details on the current
>>>>>>>>>>> implementation. All feedback is welcome.
>>>>>>>>>>> 
>>>>>>>>>>> 10K ft overview of the current implementation:
>>>>>>>>>>> 
>>>>>>>>>>> 1. Developers convert their row oriented stream into records
>>>>>>> based
>>>>>>>>> on
>>>>>>>>>>> the Arrow Intermediate Representation (AIR). At this stage the
>>>>>>>>>>> translation
>>>>>>>>>>> can be quite mechanical but if needed developers can decide
>> for
>>>>>>>>> example
>>>>>>>>>>> to
>>>>>>>>>>> translate a map into a struct if that makes sense for them.
>> The
>>>>>>>>> current
>>>>>>>>>>> implementation support the following arrow data types: bool,
>> all
>>>>>>>>> uints,
>>>>>>>>>>> all
>>>>>>>>>>> ints, all floats, string, binary, list of any supported types,
>>>>>>> and
>>>>>>>>>>> struct
>>>>>>>>>>> of any supported types. Additional Arrow types could be added
>>>>>>>>>>> progressively.
>>>>>>>>>>> 2. The row oriented record (i.e. AIR record) is then added to
>> a
>>>>>>>>>>> RecordRepository. This repository will first compute a schema
>>>>>>>>> signature
>>>>>>>>>>> and
>>>>>>>>>>> will route the record to a RecordBatcher based on this
>>>>>>> signature.
>>>>>>>>>>> 3. The RecordBatcher is responsible for collecting all the
>>>>>>>>> compatible
>>>>>>>>>>> AIR records and, upon request, the "batcher" is able to build
>> an
>>>>>>>>> Arrow
>>>>>>>>>>> Record representing a batch of compatible inputs. In the
>> current
>>>>>>>>>>> implementation, the batcher is able to convert string columns
>> to
>>>>>>>>>>> dictionary
>>>>>>>>>>> based on a configuration. Another configuration allows to
>>>>>>> evaluate
>>>>>>>>> which
>>>>>>>>>>> columns should be sorted to optimize the compression ratio.
>> The
>>>>>>>> same
>>>>>>>>>>> optimization process could be applied to binary columns.
>>>>>>>>>>> 4. Steps 1 through 3 can be repeated on the same
>>>>>>> RecordRepository
>>>>>>>>>>> instance to build new sets of arrow record batches. Subsequent
>>>>>>>>>>> iterations
>>>>>>>>>>> will be slightly faster due to different techniques used (e.g.
>>>>>>>>> object
>>>>>>>>>>> reuse, dictionary reuse and sorting, ...)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> The current Go implementation
>>>>>>>>>>> <https://github.com/lquerel/otel-arrow-adapter> (WIP) is
>>> currently
>>>>>>>>> part of
>>>>>>>>>>> this repo (see pkg/air package). If the community is
>> interested,
>>> I
>>>>>>>>> could do
>>>>>>>>>>> a PR in the Arrow Go and Rust sub-projects.
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Laurent Quérel
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Laurent Quérel
>>>> 
>>> 
>>> 
>>> --
>>> Laurent Quérel
>>> 
>> 
> 
> 
> -- 
> Laurent Quérel

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Reply via email to