IIRC the discovery step does already try to unify the schemas, it's just that right now, schema unification is basically not implemented. There's a long-standing Jira/PR [1] that might be good for someone to pick up and push over the finish line.
[1]: https://github.com/apache/arrow/pull/12000 -David On Thu, Nov 10, 2022, at 13:24, Weston Pace wrote: >> I’ve done something like this in the past. It was two parts - first figure >> out the desired schema and then when reading files make them conform to >> that schema. > > Good point. So far I've just been focusing on the second part. There > is a dataset discovery step that will try and do the first part but it > isn't terribly flexible at the moment. Improving this is probably > worth consideration as well. > > On Wed, Nov 9, 2022 at 5:25 PM Ben Chambers <bchamb...@apache.org> wrote: >> >> I’ve done something like this in the past. It was two parts - first figure >> out the desired schema and then when reading files make them conform to >> that schema. >> >> The first step could be by specifying the schema or by unioning the >> schemas. Fields appearing in only some files are treated as null in the >> others. Fields with different types are up cast. >> >> The second step then involves for each file figuring out how to convert to >> the desired. I found it easiest to do this per column of the desired >> schema. Then it can be (1) reference a column (2) reference a column and >> cast or (3) create a column of nulls of a given type. >> >> Is something like that you had in mind? >> >> On Wed, Nov 9, 2022 at 5:11 PM Weston Pace <weston.p...@gmail.com> wrote: >> >> > From a datasets / Acero perspective I have been thinking about this in >> > the back of my mind for a while and decided to write my thoughts down >> > in a document. I will send it in a separate email. >> > >> > On Tue, Nov 8, 2022 at 9:53 AM Micah Kornfield <emkornfi...@gmail.com> >> > wrote: >> > > >> > > Hi Matthew, >> > > Could you give some more specifics about what language/component you are >> > > using. In general, Arrow at a specification level doesn't deal with >> > schema >> > > evolution. Is this in regard to Datasets or a different component? >> > > >> > > Thanks, >> > > Micah >> > > >> > > On Mon, Nov 7, 2022 at 5:06 PM Matthew Scanlon < >> > > matthew.scan...@exosfinancial.com> wrote: >> > > >> > > > Good afternoon, I wanted to reach out and open a dialog about structs, >> > the >> > > > evolution of them in schemas, and if support for such a feature is on >> > the >> > > > road map or a hard pass for the arrow team. >> > > > >> > > > Currently, it appears structs support removing a field, but will there >> > be >> > > > support for adding fields later on? Are there any recommended patterns >> > for >> > > > supporting such a field. For example, if a field foo is a struct with >> > > > sub_fields A, B and then later field C gets added, the old data can >> > not be >> > > > loaded using the new schema. >> > > > >> > > > Thank you. >> > > > >> > > > Matthew Scanlon >> > > > >> > > > -- >> > > > >> > > > >> > > > Broker-Dealer services offered through Exos Securities LLC, member of >> > > > SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> / >> > > > BrokerCheck >> > > > <https://brokercheck.finra.org/>/ 2022 Exos, inc. For important >> > > > disclosures, click here >> > > > <https://www.exosfinancial.com/general-disclosures>. >> > > > >> > > > >> > > > >> > > > >> >