IIRC the discovery step does already try to unify the schemas, it's just that 
right now, schema unification is basically not implemented. There's a 
long-standing Jira/PR [1] that might be good for someone to pick up and push 
over the finish line.

[1]: https://github.com/apache/arrow/pull/12000

-David

On Thu, Nov 10, 2022, at 13:24, Weston Pace wrote:
>> I’ve done something like this in the past. It was two parts - first figure
>> out the desired schema and then when reading files make them conform to
>> that schema.
>
> Good point.  So far I've just been focusing on the second part.  There
> is a dataset discovery step that will try and do the first part but it
> isn't terribly flexible at the moment.  Improving this is probably
> worth consideration as well.
>
> On Wed, Nov 9, 2022 at 5:25 PM Ben Chambers <bchamb...@apache.org> wrote:
>>
>> I’ve done something like this in the past. It was two parts - first figure
>> out the desired schema and then when reading files make them conform to
>> that schema.
>>
>> The first step could be by specifying the schema or by unioning the
>> schemas. Fields appearing in only some files are treated as null in the
>> others. Fields with different types are up cast.
>>
>> The second step then involves for each file figuring out how to convert to
>> the desired. I found it easiest to do this per column of the desired
>> schema. Then it can be (1) reference a column (2) reference a column and
>> cast or (3) create a column of nulls of a given type.
>>
>> Is something like that you had in mind?
>>
>> On Wed, Nov 9, 2022 at 5:11 PM Weston Pace <weston.p...@gmail.com> wrote:
>>
>> > From a datasets / Acero perspective I have been thinking about this in
>> > the back of my mind for a while and decided to write my thoughts down
>> > in a document.  I will send it in a separate email.
>> >
>> > On Tue, Nov 8, 2022 at 9:53 AM Micah Kornfield <emkornfi...@gmail.com>
>> > wrote:
>> > >
>> > > Hi Matthew,
>> > > Could you give some more specifics about what language/component you are
>> > > using.  In general, Arrow at a specification level doesn't deal with
>> > schema
>> > > evolution.  Is this in regard to Datasets or a different component?
>> > >
>> > > Thanks,
>> > > Micah
>> > >
>> > > On Mon, Nov 7, 2022 at 5:06 PM Matthew Scanlon <
>> > > matthew.scan...@exosfinancial.com> wrote:
>> > >
>> > > > Good afternoon, I wanted to reach out and open a dialog about structs,
>> > the
>> > > > evolution of them in schemas, and if support for such a feature is on
>> > the
>> > > > road map or a hard pass for the arrow team.
>> > > >
>> > > > Currently, it appears structs support removing a field, but will there
>> > be
>> > > > support for adding fields later on? Are there any recommended patterns
>> > for
>> > > > supporting such a field. For example, if a field foo is a struct with
>> > > > sub_fields A, B and then later field C gets added, the old data can
>> > not be
>> > > > loaded using the new schema.
>> > > >
>> > > > Thank you.
>> > > >
>> > > > Matthew Scanlon
>> > > >
>> > > > --
>> > > >
>> > > >
>> > > > Broker-Dealer services offered through Exos Securities LLC, member of
>> > > > SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> /
>> > > > BrokerCheck
>> > > > <https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important
>> > > > disclosures, click here
>> > > > <https://www.exosfinancial.com/general-disclosures>.
>> > > >
>> > > >
>> > > >
>> > > >
>> >

Reply via email to