Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Andreas Heider
There's been a bunch of work on adding page indices to parquet: https://github.com/apache/parquet-format/blob/master/PageIndex.md I haven't followed progress in detail but I think the Java implementation supports this now. Look

(De)serialising schemas in pyarrow

2018-07-18 Thread Andreas Heider
Hi, I'm using Arrow together with dask to quickly write lots of parquet files. Pandas has a tendency to forget column types (in my case it's a string column that might be completely null in some splits), so I'm building a Schema once and then manually passing that Schema into pa.Table.from_pand