(De)serialising schemas in pyarrow

Andreas Heider Wed, 18 Jul 2018 11:43:14 -0700

Hi,

I'm using Arrow together with dask to quickly write lots of parquet files. 
Pandas has a tendency to forget column types (in my case it's a string column 
that might be completely null in some splits), so I'm building a Schema once 
and then manually passing that Schema into pa.Table.from_pandas and 
pq.ParquetWriter so all resulting files consistently have the same types.


However, due to dask being distributed passing around that Schema involves 
serialising the Schema and sending it to different processes, and this was a 
bit harder than expected.

Simple pickling fails with "No type alias for double" on unpickling.

Schema does have a  .serialize(), but I can't find how to deserialize it again? 
pa.deserialize says "Expected to read 923444752 metadata bytes but only read 
11284. It also looks like pa.deserialize is meant for Python objects.

So I've settled on this for now:

def serialize_schema(schema):
    sink = pa.BufferOutputStream()
    writer = pa.RecordBatchStreamWriter(sink, schema)
    writer.close()
    return sink.get_result().to_pybytes()

def deserialize_schema(buf):
    buf_reader = pa.BufferReader(but)
    reader = pa.RecordBatchStreamReader(but_reader)
    return reader.schema

This works, but is a bit more involved than I hoped it'd be.

Do you have any advice how this is meant to work?

Thanks,
Andreas

(De)serialising schemas in pyarrow

Reply via email to