Re: Human-readable version of Arrow Schema?

Christian Hudon Tue, 05 May 2020 11:28:54 -0700

Hi folks! I'm back.

Yes to François's comments. This has to be something that is readable by
data scientists, researchers, etc. without having the doc side-by-side,
which is definitely not the case for the C-interface representation.


I've created a draft pull request with code that's definitely not ready to
be merged, but works enough to output a Flatbuffers JSON representation of
an Arrow schema, so people can see what it would look like, experiment, etc.

An an example, the following Arrow schema:

  std::vector<std::shared_ptr<arrow::Field>> schema_vector = {
    arrow::field("id", arrow::int64()),
    arrow::field("cost", arrow::float64()),
    arrow::field("cost_components", arrow::list(arrow::float64()))};
  auto schema = arrow::Schema(schema_vector);

translates to (with some reformatting to make things more compact):

{
  fields: [
    {name: "id", nullable: true, type_type: "Int", type: {bitWidth:
64, is_signed: true},
      children: []},
    {name: "cost", nullable: true, type_type: "FloatingPoint", type:
{precision: "DOUBLE"},
      children: []},
    {name: "cost_components", nullable: true, type_type: "List", type: {},
      children: [
        {name: "item", nullable: true, type_type: "FloatingPoint", type:
{precision: "DOUBLE"},
          children: []}
      ]}
  ]
}

I can definitely see data scientists being able to understand that or make
small changes without the doc, and even write one from scratch with some
help from documentation. It could even be made more compact by making a few
fields optional when empty (children, type).

If you want to try it out on other schemas, here's the pull request:
https://github.com/apache/arrow/pull/7110

Thoughts?


Le jeu. 9 janv. 2020, à 08 h 47, Francois Saint-Jacques <
[email protected]> a écrit :

> The desired goal for this feature is trivial modifications, e.g.
> within an editor, by data-scientists and researchers.
>
> I'd go for the flatbuffer's json representation as it is stable and
> has native support in almost any language or editor due to the
> ubiquity of JSON. The C interface schema string representation is
> optimized for developers writing parser/codecs and looks like
> gibberish to anyone not familiar with python's struct format string.
>
> François
>
>
> On Wed, Jan 8, 2020 at 8:50 PM Kohei KaiGai <[email protected]> wrote:
> >
> > Hello,
> >
> > pg2arrow [*1] has '--dump' mode to print out schema definition of the
> > given Apache Arrow file.
> > Does it make sense for you?
> >
> > $ ./pg2arrow --dump ~/hoge.arrow
> > [Footer]
> > {Footer: version=V4, schema={Schema: endianness=little,
> > fields=[{Field: name="id", nullable=true, type={Int32}, children=[],
> > custom_metadata=[]}, {Field: name="a", nullable=true, type={Float64},
> > children=[], custom_metadata=[]}, {Field: name="b", nullable=true,
> > type={Decimal: precision=11, scale=7}, children=[],
> > custom_metadata=[]}, {Field: name="c", nullable=true, type={Struct},
> > children=[{Field: name="x", nullable=true, type={Int32}, children=[],
> > custom_metadata=[]}, {Field: name="y", nullable=true, type={Float32},
> > children=[], custom_metadata=[]}, {Field: name="z", nullable=true,
> > type={Utf8}, children=[], custom_metadata=[]}], custom_metadata=[]},
> > {Field: name="d", nullable=true, type={Utf8},
> > dictionary={DictionaryEncoding: id=0, indexType={Int32},
> > isOrdered=false}, children=[], custom_metadata=[]}, {Field: name="e",
> > nullable=true, type={Timestamp: unit=us}, children=[],
> > custom_metadata=[]}, {Field: name="f", nullable=true, type={Utf8},
> > children=[], custom_metadata=[]}, {Field: name="random",
> > nullable=true, type={Float64}, children=[], custom_metadata=[]}],
> > custom_metadata=[{KeyValue: key="sql_command" value="SELECT *,random()
> > FROM t"}]}, dictionaries=[{Block: offset=920, metaDataLength=184
> > bodyLength=128}], recordBatches=[{Block: offset=1232,
> > metaDataLength=648 bodyLength=386112}]}
> > [Dictionary Batch 0]
> > {Block: offset=920, metaDataLength=184 bodyLength=128}
> > {Message: version=V4, body={DictionaryBatch: id=0, data={RecordBatch:
> > length=6, nodes=[{FieldNode: length=6, null_count=0}],
> > buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=64},
> > {Buffer: offset=64, length=64}]}, isDelta=false}, bodyLength=128}
> > [Record Batch 0]
> > {Block: offset=1232, metaDataLength=648 bodyLength=386112}
> > {Message: version=V4, body={RecordBatch: length=3000,
> > nodes=[{FieldNode: length=3000, null_count=0}, {FieldNode:
> > length=3000, null_count=60}, {FieldNode: length=3000, null_count=62},
> > {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000,
> > null_count=56}, {FieldNode: length=3000, null_count=66}, {FieldNode:
> > length=3000, null_count=0}, {FieldNode: length=3000, null_count=0},
> > {FieldNode: length=3000, null_count=64}, {FieldNode: length=3000,
> > null_count=0}, {FieldNode: length=3000, null_count=0}],
> > buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0,
> > length=12032}, {Buffer: offset=12032, length=384}, {Buffer:
> > offset=12416, length=24000}, {Buffer: offset=36416, length=384},
> > {Buffer: offset=36800, length=48000}, {Buffer: offset=84800,
> > length=0}, {Buffer: offset=84800, length=384}, {Buffer: offset=85184,
> > length=12032}, {Buffer: offset=97216, length=384}, {Buffer:
> > offset=97600, length=12032}, {Buffer: offset=109632, length=0},
> > {Buffer: offset=109632, length=12032}, {Buffer: offset=121664,
> > length=96000}, {Buffer: offset=217664, length=0}, {Buffer:
> > offset=217664, length=12032}, {Buffer: offset=229696, length=384},
> > {Buffer: offset=230080, length=24000}, {Buffer: offset=254080,
> > length=0}, {Buffer: offset=254080, length=12032}, {Buffer:
> > offset=266112, length=96000}, {Buffer: offset=362112, length=0},
> > {Buffer: offset=362112, length=24000}]}, bodyLength=386112}
> >
> > [*1] https://heterodb.github.io/pg-strom/arrow_fdw/#using-pg2arrow
> >
> > 2019年12月7日(土) 6:26 Christian Hudon <[email protected]>:
> > >
> > > Hi,
> > >
> > > For the uses I would like to make of Arrow, I would need a
> human-readable
> > > and -writable version of an Arrow Schema, that could be converted to
> and
> > > from the Arrow Schema C++ object. Going through the doc for 0.15.1, I
> don't
> > > see anything to that effect, with the closest being the ToString()
> method
> > > on DataType instances, but which is meant for debugging only. (I need
> an
> > > expression of an Arrow Schema that people can read, and that can live
> > > outside of the code for a particular operation.)
> > >
> > > Is a text representation of an Arrow Schema something that is being
> worked
> > > on now? If not, would you folks be interested in me putting up an
> initial
> > > proposal for discussion? Any design constraints I should pay attention
> to,
> > > then?
> > >
> > > Thanks,
> > >
> > >   Christian
> > > --
> > >
> > >
> > > │ Christian Hudon
> > >
> > > │ Applied Research Scientist
> > >
> > >    Element AI, 6650 Saint-Urbain #500
> > >
> > >    Montréal, QC, H2S 3G9, Canada
> > >    Elementai.com
> >
> >
> >
> > --
> > HeteroDB, Inc / The PG-Strom Project
> > KaiGai Kohei <[email protected]>
>


-- 


│ Christian Hudon

│ Applied Research Scientist

   Element AI, 6650 Saint-Urbain #500

   Montréal, QC, H2S 3G9, Canada
   Elementai.com

Re: Human-readable version of Arrow Schema?

Reply via email to