I am running into a problem converting a csv file into a parquet file in
chunks, where one of the string columns is null for the first several
million rows.
Self-contained dummy example:
csv_file = '/tmp/df.csv'
parquet_file = '/tmp/df.parquet'
df = pd.DataFrame([np.nan] * 3 + ['hello'], columns=['a'])
df.to_csv(csv_file, index=False, na_rep='.')
display(df)
for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=2,
na_values=['.'], dtype={'a': str})):
print(i)
display(chunk)
if i == 0:
parquet_schema = pa.Table.from_pandas(chunk).schema
parquet_writer = pq.ParquetWriter(parquet_file,
parquet_schema, compression='snappy')
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
parquet_writer.write_table(table)
Any suggestions would be much appreciated.
Running pyarrow=0.4.1=np112py36_1 installed using conda on Linux Mint 18.1
And thanks a lot for developing pyarrow.parquet!
Alexey