hi Andreas, You want the read_schema method -- much simpler. Check out the unit tests:
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_ipc.py#L502 Would be nice to have this in the Sphinx docs - Wes On Wed, Jul 18, 2018 at 2:41 PM, Andreas Heider <andr...@heider.io> wrote: > Hi, > > I'm using Arrow together with dask to quickly write lots of parquet files. > Pandas has a tendency to forget column types (in my case it's a string column > that might be completely null in some splits), so I'm building a Schema once > and then manually passing that Schema into pa.Table.from_pandas and > pq.ParquetWriter so all resulting files consistently have the same types. > > However, due to dask being distributed passing around that Schema involves > serialising the Schema and sending it to different processes, and this was a > bit harder than expected. > > Simple pickling fails with "No type alias for double" on unpickling. > > Schema does have a .serialize(), but I can't find how to deserialize it > again? pa.deserialize says "Expected to read 923444752 metadata bytes but > only read 11284. It also looks like pa.deserialize is meant for Python > objects. > > So I've settled on this for now: > > def serialize_schema(schema): > sink = pa.BufferOutputStream() > writer = pa.RecordBatchStreamWriter(sink, schema) > writer.close() > return sink.get_result().to_pybytes() > > def deserialize_schema(buf): > buf_reader = pa.BufferReader(but) > reader = pa.RecordBatchStreamReader(but_reader) > return reader.schema > > This works, but is a bit more involved than I hoped it'd be. > > Do you have any advice how this is meant to work? > > Thanks, > Andreas