I am running into a problem converting a csv file into a parquet file in chunks, where one of the string columns is null for the first several million rows.
Self-contained dummy example: csv_file = '/tmp/df.csv' parquet_file = '/tmp/df.parquet' df = pd.DataFrame([np.nan] * 3 + ['hello'], columns=['a']) df.to_csv(csv_file, index=False, na_rep='.') display(df) for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=2, na_values=['.'], dtype={'a': str})): print(i) display(chunk) if i == 0: parquet_schema = pa.Table.from_pandas(chunk).schema parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy') table = pa.Table.from_pandas(chunk, schema=parquet_schema) parquet_writer.write_table(table) Any suggestions would be much appreciated. Running pyarrow=0.4.1=np112py36_1 installed using conda on Linux Mint 18.1 And thanks a lot for developing pyarrow.parquet! Alexey