Aw: Re: Human-readable version of Arrow Schema

hans-joachim . bothe Thu, 07 May 2020 02:47:02 -0700

Hi Chris,

nice work. I am actually doing the same thing from the Python side and got a 
similar result. Only differences are
 - marking the JSON structure as a "schema"
 - using factory function names as "datatype" (see 
https://arrow.apache.org/docs/python/api/datatypes.html)
 - adding metadata


I would be glad in helping to bring this nice idea to real life. Just 
downloaded your code and started playing with the C side to see the 
differences, already adopted your "children" idea as you will see. I am looking 
foreward to a fruitful discussion. Here is my Python result in JSON:

{
        "schema": {
                "fields": [{
                                "name": "name",
                                "datatype": "string",
                                "nullable": false,
                                "metadata": {
                                        "m1": "meta 1",
                                        "m2": "meta 2",
                                        "m3": "meta 3"
                                },
                                "children": []
                        },
                        {
                                "name": "description",
                                "datatype": "string",
                                "nullable": true,
                                "metadata": {
                                        "m1": "meta 1",
                                        "m2": "meta 2",
                                        "m3": "meta 3"
                                },
                                "children": []
                        }
                ],
                "metadata": {
                        "m1": "meta 1",
                        "m2": "meta 2",
                        "m3": "meta 3"
                }
        }
}

Cheers,
Hans

> Gesendet: Dienstag, 05. Mai 2020 um 20:28 Uhr
> Von: "Christian Hudon" <chr...@elementai.com>
> An: "dev@arrow.apache.org" <dev@arrow.apache.org>
> Betreff: Re: Human-readable version of Arrow Schema?
>
> Hi folks! I'm back.
> 
> Yes to François's comments. This has to be something that is readable by
> data scientists, researchers, etc. without having the doc side-by-side,
> which is definitely not the case for the C-interface representation.
> 
> I've created a draft pull request with code that's definitely not ready to
> be merged, but works enough to output a Flatbuffers JSON representation of
> an Arrow schema, so people can see what it would look like, experiment, etc.
> 
> An an example, the following Arrow schema:
> 
>   std::vector<std::shared_ptr<arrow::Field>> schema_vector = {
>     arrow::field("id", arrow::int64()),
>     arrow::field("cost", arrow::float64()),
>     arrow::field("cost_components", arrow::list(arrow::float64()))};
>   auto schema = arrow::Schema(schema_vector);
> 
> translates to (with some reformatting to make things more compact):
> 
> {
>   fields: [
>     {name: "id", nullable: true, type_type: "Int", type: {bitWidth:
> 64, is_signed: true},
>       children: []},
>     {name: "cost", nullable: true, type_type: "FloatingPoint", type:
> {precision: "DOUBLE"},
>       children: []},
>     {name: "cost_components", nullable: true, type_type: "List", type: {},
>       children: [
>         {name: "item", nullable: true, type_type: "FloatingPoint", type:
> {precision: "DOUBLE"},
>           children: []}
>       ]}
>   ]
> }
> 
> I can definitely see data scientists being able to understand that or make
> small changes without the doc, and even write one from scratch with some
> help from documentation. It could even be made more compact by making a few
> fields optional when empty (children, type).
> 
> If you want to try it out on other schemas, here's the pull request:
> https://github.com/apache/arrow/pull/7110
> 
> Thoughts?
> 
> 
> Le jeu. 9 janv. 2020, à 08 h 47, Francois Saint-Jacques <
> fsaintjacq...@gmail.com> a écrit :
> 
> > The desired goal for this feature is trivial modifications, e.g.
> > within an editor, by data-scientists and researchers.
> >
> > I'd go for the flatbuffer's json representation as it is stable and
> > has native support in almost any language or editor due to the
> > ubiquity of JSON. The C interface schema string representation is
> > optimized for developers writing parser/codecs and looks like
> > gibberish to anyone not familiar with python's struct format string.
> >
> > François
> >
> >
> > On Wed, Jan 8, 2020 at 8:50 PM Kohei KaiGai <kai...@heterodb.com> wrote:
> > >
> > > Hello,
> > >
> > > pg2arrow [*1] has '--dump' mode to print out schema definition of the
> > > given Apache Arrow file.
> > > Does it make sense for you?
> > >
> > > $ ./pg2arrow --dump ~/hoge.arrow
> > > [Footer]
> > > {Footer: version=V4, schema={Schema: endianness=little,
> > > fields=[{Field: name="id", nullable=true, type={Int32}, children=[],
> > > custom_metadata=[]}, {Field: name="a", nullable=true, type={Float64},
> > > children=[], custom_metadata=[]}, {Field: name="b", nullable=true,
> > > type={Decimal: precision=11, scale=7}, children=[],
> > > custom_metadata=[]}, {Field: name="c", nullable=true, type={Struct},
> > > children=[{Field: name="x", nullable=true, type={Int32}, children=[],
> > > custom_metadata=[]}, {Field: name="y", nullable=true, type={Float32},
> > > children=[], custom_metadata=[]}, {Field: name="z", nullable=true,
> > > type={Utf8}, children=[], custom_metadata=[]}], custom_metadata=[]},
> > > {Field: name="d", nullable=true, type={Utf8},
> > > dictionary={DictionaryEncoding: id=0, indexType={Int32},
> > > isOrdered=false}, children=[], custom_metadata=[]}, {Field: name="e",
> > > nullable=true, type={Timestamp: unit=us}, children=[],
> > > custom_metadata=[]}, {Field: name="f", nullable=true, type={Utf8},
> > > children=[], custom_metadata=[]}, {Field: name="random",
> > > nullable=true, type={Float64}, children=[], custom_metadata=[]}],
> > > custom_metadata=[{KeyValue: key="sql_command" value="SELECT *,random()
> > > FROM t"}]}, dictionaries=[{Block: offset=920, metaDataLength=184
> > > bodyLength=128}], recordBatches=[{Block: offset=1232,
> > > metaDataLength=648 bodyLength=386112}]}
> > > [Dictionary Batch 0]
> > > {Block: offset=920, metaDataLength=184 bodyLength=128}
> > > {Message: version=V4, body={DictionaryBatch: id=0, data={RecordBatch:
> > > length=6, nodes=[{FieldNode: length=6, null_count=0}],
> > > buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=64},
> > > {Buffer: offset=64, length=64}]}, isDelta=false}, bodyLength=128}
> > > [Record Batch 0]
> > > {Block: offset=1232, metaDataLength=648 bodyLength=386112}
> > > {Message: version=V4, body={RecordBatch: length=3000,
> > > nodes=[{FieldNode: length=3000, null_count=0}, {FieldNode:
> > > length=3000, null_count=60}, {FieldNode: length=3000, null_count=62},
> > > {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000,
> > > null_count=56}, {FieldNode: length=3000, null_count=66}, {FieldNode:
> > > length=3000, null_count=0}, {FieldNode: length=3000, null_count=0},
> > > {FieldNode: length=3000, null_count=64}, {FieldNode: length=3000,
> > > null_count=0}, {FieldNode: length=3000, null_count=0}],
> > > buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0,
> > > length=12032}, {Buffer: offset=12032, length=384}, {Buffer:
> > > offset=12416, length=24000}, {Buffer: offset=36416, length=384},
> > > {Buffer: offset=36800, length=48000}, {Buffer: offset=84800,
> > > length=0}, {Buffer: offset=84800, length=384}, {Buffer: offset=85184,
> > > length=12032}, {Buffer: offset=97216, length=384}, {Buffer:
> > > offset=97600, length=12032}, {Buffer: offset=109632, length=0},
> > > {Buffer: offset=109632, length=12032}, {Buffer: offset=121664,
> > > length=96000}, {Buffer: offset=217664, length=0}, {Buffer:
> > > offset=217664, length=12032}, {Buffer: offset=229696, length=384},
> > > {Buffer: offset=230080, length=24000}, {Buffer: offset=254080,
> > > length=0}, {Buffer: offset=254080, length=12032}, {Buffer:
> > > offset=266112, length=96000}, {Buffer: offset=362112, length=0},
> > > {Buffer: offset=362112, length=24000}]}, bodyLength=386112}
> > >
> > > [*1] https://heterodb.github.io/pg-strom/arrow_fdw/#using-pg2arrow
> > >
> > > 2019年12月7日(土) 6:26 Christian Hudon <chr...@elementai.com>:
> > > >
> > > > Hi,
> > > >
> > > > For the uses I would like to make of Arrow, I would need a
> > human-readable
> > > > and -writable version of an Arrow Schema, that could be converted to
> > and
> > > > from the Arrow Schema C++ object. Going through the doc for 0.15.1, I
> > don't
> > > > see anything to that effect, with the closest being the ToString()
> > method
> > > > on DataType instances, but which is meant for debugging only. (I need
> > an
> > > > expression of an Arrow Schema that people can read, and that can live
> > > > outside of the code for a particular operation.)
> > > >
> > > > Is a text representation of an Arrow Schema something that is being
> > worked
> > > > on now? If not, would you folks be interested in me putting up an
> > initial
> > > > proposal for discussion? Any design constraints I should pay attention
> > to,
> > > > then?
> > > >
> > > > Thanks,
> > > >
> > > >   Christian
> > > > --
> > > >
> > > >
> > > > │ Christian Hudon
> > > >
> > > > │ Applied Research Scientist
> > > >
> > > >    Element AI, 6650 Saint-Urbain #500
> > > >
> > > >    Montréal, QC, H2S 3G9, Canada
> > > >    Elementai.com
> > >
> > >
> > >
> > > --
> > > HeteroDB, Inc / The PG-Strom Project
> > > KaiGai Kohei <kai...@heterodb.com>
> >
> 
> 
> -- 
> 
> 
> │ Christian Hudon
> 
> │ Applied Research Scientist
> 
>    Element AI, 6650 Saint-Urbain #500
> 
>    Montréal, QC, H2S 3G9, Canada
>    Elementai.com
>

Aw: Re: Human-readable version of Arrow Schema

Reply via email to