I am running into a problem converting a csv file into a parquet file in
chunks, where one of the string columns is null for the first several
million rows.

Self-contained dummy example:

csv_file = '/tmp/df.csv'
parquet_file = '/tmp/df.parquet'

df = pd.DataFrame([np.nan] * 3 + ['hello'], columns=['a'])
df.to_csv(csv_file, index=False, na_rep='.')
display(df)

for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=2,
na_values=['.'], dtype={'a': str})):
    print(i)
    display(chunk)
    if i == 0:
        parquet_schema = pa.Table.from_pandas(chunk).schema
        parquet_writer = pq.ParquetWriter(parquet_file,
parquet_schema, compression='snappy')
    table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    parquet_writer.write_table(table)

Any suggestions would be much appreciated.

Running pyarrow=0.4.1=np112py36_1 installed using conda on Linux Mint 18.1

And thanks a lot for developing pyarrow.parquet!
Alexey
​

Reply via email to