Hi Chris, nice work. I am actually doing the same thing from the Python side and got a similar result. Only differences are - marking the JSON structure as a "schema" - using factory function names as "datatype" (see https://arrow.apache.org/docs/python/api/datatypes.html) - adding metadata
I would be glad in helping to bring this nice idea to real life. Just downloaded your code and started playing with the C side to see the differences, already adopted your "children" idea as you will see. I am looking foreward to a fruitful discussion. Here is my Python result in JSON: { "schema": { "fields": [{ "name": "name", "datatype": "string", "nullable": false, "metadata": { "m1": "meta 1", "m2": "meta 2", "m3": "meta 3" }, "children": [] }, { "name": "description", "datatype": "string", "nullable": true, "metadata": { "m1": "meta 1", "m2": "meta 2", "m3": "meta 3" }, "children": [] } ], "metadata": { "m1": "meta 1", "m2": "meta 2", "m3": "meta 3" } } } Cheers, Hans > Gesendet: Dienstag, 05. Mai 2020 um 20:28 Uhr > Von: "Christian Hudon" <chr...@elementai.com> > An: "dev@arrow.apache.org" <dev@arrow.apache.org> > Betreff: Re: Human-readable version of Arrow Schema? > > Hi folks! I'm back. > > Yes to François's comments. This has to be something that is readable by > data scientists, researchers, etc. without having the doc side-by-side, > which is definitely not the case for the C-interface representation. > > I've created a draft pull request with code that's definitely not ready to > be merged, but works enough to output a Flatbuffers JSON representation of > an Arrow schema, so people can see what it would look like, experiment, etc. > > An an example, the following Arrow schema: > > std::vector<std::shared_ptr<arrow::Field>> schema_vector = { > arrow::field("id", arrow::int64()), > arrow::field("cost", arrow::float64()), > arrow::field("cost_components", arrow::list(arrow::float64()))}; > auto schema = arrow::Schema(schema_vector); > > translates to (with some reformatting to make things more compact): > > { > fields: [ > {name: "id", nullable: true, type_type: "Int", type: {bitWidth: > 64, is_signed: true}, > children: []}, > {name: "cost", nullable: true, type_type: "FloatingPoint", type: > {precision: "DOUBLE"}, > children: []}, > {name: "cost_components", nullable: true, type_type: "List", type: {}, > children: [ > {name: "item", nullable: true, type_type: "FloatingPoint", type: > {precision: "DOUBLE"}, > children: []} > ]} > ] > } > > I can definitely see data scientists being able to understand that or make > small changes without the doc, and even write one from scratch with some > help from documentation. It could even be made more compact by making a few > fields optional when empty (children, type). > > If you want to try it out on other schemas, here's the pull request: > https://github.com/apache/arrow/pull/7110 > > Thoughts? > > > Le jeu. 9 janv. 2020, à 08 h 47, Francois Saint-Jacques < > fsaintjacq...@gmail.com> a écrit : > > > The desired goal for this feature is trivial modifications, e.g. > > within an editor, by data-scientists and researchers. > > > > I'd go for the flatbuffer's json representation as it is stable and > > has native support in almost any language or editor due to the > > ubiquity of JSON. The C interface schema string representation is > > optimized for developers writing parser/codecs and looks like > > gibberish to anyone not familiar with python's struct format string. > > > > François > > > > > > On Wed, Jan 8, 2020 at 8:50 PM Kohei KaiGai <kai...@heterodb.com> wrote: > > > > > > Hello, > > > > > > pg2arrow [*1] has '--dump' mode to print out schema definition of the > > > given Apache Arrow file. > > > Does it make sense for you? > > > > > > $ ./pg2arrow --dump ~/hoge.arrow > > > [Footer] > > > {Footer: version=V4, schema={Schema: endianness=little, > > > fields=[{Field: name="id", nullable=true, type={Int32}, children=[], > > > custom_metadata=[]}, {Field: name="a", nullable=true, type={Float64}, > > > children=[], custom_metadata=[]}, {Field: name="b", nullable=true, > > > type={Decimal: precision=11, scale=7}, children=[], > > > custom_metadata=[]}, {Field: name="c", nullable=true, type={Struct}, > > > children=[{Field: name="x", nullable=true, type={Int32}, children=[], > > > custom_metadata=[]}, {Field: name="y", nullable=true, type={Float32}, > > > children=[], custom_metadata=[]}, {Field: name="z", nullable=true, > > > type={Utf8}, children=[], custom_metadata=[]}], custom_metadata=[]}, > > > {Field: name="d", nullable=true, type={Utf8}, > > > dictionary={DictionaryEncoding: id=0, indexType={Int32}, > > > isOrdered=false}, children=[], custom_metadata=[]}, {Field: name="e", > > > nullable=true, type={Timestamp: unit=us}, children=[], > > > custom_metadata=[]}, {Field: name="f", nullable=true, type={Utf8}, > > > children=[], custom_metadata=[]}, {Field: name="random", > > > nullable=true, type={Float64}, children=[], custom_metadata=[]}], > > > custom_metadata=[{KeyValue: key="sql_command" value="SELECT *,random() > > > FROM t"}]}, dictionaries=[{Block: offset=920, metaDataLength=184 > > > bodyLength=128}], recordBatches=[{Block: offset=1232, > > > metaDataLength=648 bodyLength=386112}]} > > > [Dictionary Batch 0] > > > {Block: offset=920, metaDataLength=184 bodyLength=128} > > > {Message: version=V4, body={DictionaryBatch: id=0, data={RecordBatch: > > > length=6, nodes=[{FieldNode: length=6, null_count=0}], > > > buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=64}, > > > {Buffer: offset=64, length=64}]}, isDelta=false}, bodyLength=128} > > > [Record Batch 0] > > > {Block: offset=1232, metaDataLength=648 bodyLength=386112} > > > {Message: version=V4, body={RecordBatch: length=3000, > > > nodes=[{FieldNode: length=3000, null_count=0}, {FieldNode: > > > length=3000, null_count=60}, {FieldNode: length=3000, null_count=62}, > > > {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000, > > > null_count=56}, {FieldNode: length=3000, null_count=66}, {FieldNode: > > > length=3000, null_count=0}, {FieldNode: length=3000, null_count=0}, > > > {FieldNode: length=3000, null_count=64}, {FieldNode: length=3000, > > > null_count=0}, {FieldNode: length=3000, null_count=0}], > > > buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, > > > length=12032}, {Buffer: offset=12032, length=384}, {Buffer: > > > offset=12416, length=24000}, {Buffer: offset=36416, length=384}, > > > {Buffer: offset=36800, length=48000}, {Buffer: offset=84800, > > > length=0}, {Buffer: offset=84800, length=384}, {Buffer: offset=85184, > > > length=12032}, {Buffer: offset=97216, length=384}, {Buffer: > > > offset=97600, length=12032}, {Buffer: offset=109632, length=0}, > > > {Buffer: offset=109632, length=12032}, {Buffer: offset=121664, > > > length=96000}, {Buffer: offset=217664, length=0}, {Buffer: > > > offset=217664, length=12032}, {Buffer: offset=229696, length=384}, > > > {Buffer: offset=230080, length=24000}, {Buffer: offset=254080, > > > length=0}, {Buffer: offset=254080, length=12032}, {Buffer: > > > offset=266112, length=96000}, {Buffer: offset=362112, length=0}, > > > {Buffer: offset=362112, length=24000}]}, bodyLength=386112} > > > > > > [*1] https://heterodb.github.io/pg-strom/arrow_fdw/#using-pg2arrow > > > > > > 2019年12月7日(土) 6:26 Christian Hudon <chr...@elementai.com>: > > > > > > > > Hi, > > > > > > > > For the uses I would like to make of Arrow, I would need a > > human-readable > > > > and -writable version of an Arrow Schema, that could be converted to > > and > > > > from the Arrow Schema C++ object. Going through the doc for 0.15.1, I > > don't > > > > see anything to that effect, with the closest being the ToString() > > method > > > > on DataType instances, but which is meant for debugging only. (I need > > an > > > > expression of an Arrow Schema that people can read, and that can live > > > > outside of the code for a particular operation.) > > > > > > > > Is a text representation of an Arrow Schema something that is being > > worked > > > > on now? If not, would you folks be interested in me putting up an > > initial > > > > proposal for discussion? Any design constraints I should pay attention > > to, > > > > then? > > > > > > > > Thanks, > > > > > > > > Christian > > > > -- > > > > > > > > > > > > │ Christian Hudon > > > > > > > > │ Applied Research Scientist > > > > > > > > Element AI, 6650 Saint-Urbain #500 > > > > > > > > Montréal, QC, H2S 3G9, Canada > > > > Elementai.com > > > > > > > > > > > > -- > > > HeteroDB, Inc / The PG-Strom Project > > > KaiGai Kohei <kai...@heterodb.com> > > > > > -- > > > │ Christian Hudon > > │ Applied Research Scientist > > Element AI, 6650 Saint-Urbain #500 > > Montréal, QC, H2S 3G9, Canada > Elementai.com >