Hi, I'm using Arrow together with dask to quickly write lots of parquet files. Pandas has a tendency to forget column types (in my case it's a string column that might be completely null in some splits), so I'm building a Schema once and then manually passing that Schema into pa.Table.from_pandas and pq.ParquetWriter so all resulting files consistently have the same types.
However, due to dask being distributed passing around that Schema involves serialising the Schema and sending it to different processes, and this was a bit harder than expected. Simple pickling fails with "No type alias for double" on unpickling. Schema does have a .serialize(), but I can't find how to deserialize it again? pa.deserialize says "Expected to read 923444752 metadata bytes but only read 11284. It also looks like pa.deserialize is meant for Python objects. So I've settled on this for now: def serialize_schema(schema): sink = pa.BufferOutputStream() writer = pa.RecordBatchStreamWriter(sink, schema) writer.close() return sink.get_result().to_pybytes() def deserialize_schema(buf): buf_reader = pa.BufferReader(but) reader = pa.RecordBatchStreamReader(but_reader) return reader.schema This works, but is a bit more involved than I hoped it'd be. Do you have any advice how this is meant to work? Thanks, Andreas