Hello, I just wanted to follow up on our previous conversation and gain a bit more insight into the behavior of pyarrow tables reading from/writing to dataframes. I have noticed some interesting behavior related to the issue mentioned above; specifically that this does not seem to be an issue when going directly to/from pandas, only a dataset. For example, if i have a schema schema = pyarrow.schema( [ pyarrow.field( 'column_name', pyarrow.list_( pyarrow.struct( [ pyarrow.field('A', pyarrow.string()), pyarrow.field('B', pyarrow.string()), ], ), ), ) ] )
But inside my dataframe the column has data [{A: 1}, {A,2}, {A:3}] Doing something like table = pyarrow.Table.from_pandas(df, schema) results in a clean table capable of being brought back to a df with table.to_pandas() where you will see column name now has [{A: 1, B: None} ... ] But if i do something like ds = pyarrow.dataset.dataset( source=path, schema=schema, format='parquet', partitioning='hive', ).to_table() I get struct fields don't match or are in the wrong order Any thoughts on why this is? I suspect somewhere along the way pyarrow is being more strict with the parquet file since it has a defined structure of its own, but is there a way to ignore this and get behavior more similar to that of the pandas <--> pyarrow behavior? Thanks On Wed, Nov 9, 2022 at 8:25 PM Ben Chambers <bchamb...@apache.org> wrote: > I’ve done something like this in the past. It was two parts - first figure > out the desired schema and then when reading files make them conform to > that schema. > > The first step could be by specifying the schema or by unioning the > schemas. Fields appearing in only some files are treated as null in the > others. Fields with different types are up cast. > > The second step then involves for each file figuring out how to convert to > the desired. I found it easiest to do this per column of the desired > schema. Then it can be (1) reference a column (2) reference a column and > cast or (3) create a column of nulls of a given type. > > Is something like that you had in mind? > > On Wed, Nov 9, 2022 at 5:11 PM Weston Pace <weston.p...@gmail.com> wrote: > > > From a datasets / Acero perspective I have been thinking about this in > > the back of my mind for a while and decided to write my thoughts down > > in a document. I will send it in a separate email. > > > > On Tue, Nov 8, 2022 at 9:53 AM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > > Hi Matthew, > > > Could you give some more specifics about what language/component you > are > > > using. In general, Arrow at a specification level doesn't deal with > > schema > > > evolution. Is this in regard to Datasets or a different component? > > > > > > Thanks, > > > Micah > > > > > > On Mon, Nov 7, 2022 at 5:06 PM Matthew Scanlon < > > > matthew.scan...@exosfinancial.com> wrote: > > > > > > > Good afternoon, I wanted to reach out and open a dialog about > structs, > > the > > > > evolution of them in schemas, and if support for such a feature is on > > the > > > > road map or a hard pass for the arrow team. > > > > > > > > Currently, it appears structs support removing a field, but will > there > > be > > > > support for adding fields later on? Are there any recommended > patterns > > for > > > > supporting such a field. For example, if a field foo is a struct with > > > > sub_fields A, B and then later field C gets added, the old data can > > not be > > > > loaded using the new schema. > > > > > > > > Thank you. > > > > > > > > Matthew Scanlon > > > > > > > > -- > > > > > > > > > > > > Broker-Dealer services offered through Exos Securities LLC, member of > > > > SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> / > > > > BrokerCheck > > > > <https://brokercheck.finra.org/>/ 2022 Exos, inc. For important > > > > disclosures, click here > > > > <https://www.exosfinancial.com/general-disclosures>. > > > > > > > > > > > > > > > > > > > -- Broker-Dealer services offered through Exos Securities LLC, member of SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> / BrokerCheck <https://brokercheck.finra.org/>/ 2022 Exos, inc. For important disclosures, click here <https://www.exosfinancial.com/general-disclosures>.