Hello Alexey, 

you discovered a known bug in 0.4.1. If a column is only made up of None
objects, then writing to Parquet fails. This is fixed upstream and will
be included in the upcoming 0.5.0 release.

Uwe


On Sat, Jul 8, 2017, at 04:32 AM, Alexey Strokach wrote:
> I am running into a problem converting a csv file into a parquet file in
> chunks, where one of the string columns is null for the first several
> million rows.
> 
> Self-contained dummy example:
> 
> csv_file = '/tmp/df.csv'
> parquet_file = '/tmp/df.parquet'
> 
> df = pd.DataFrame([np.nan] * 3 + ['hello'], columns=['a'])
> df.to_csv(csv_file, index=False, na_rep='.')
> display(df)
> 
> for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=2,
> na_values=['.'], dtype={'a': str})):
>     print(i)
>     display(chunk)
>     if i == 0:
>         parquet_schema = pa.Table.from_pandas(chunk).schema
>         parquet_writer = pq.ParquetWriter(parquet_file,
> parquet_schema, compression='snappy')
>     table = pa.Table.from_pandas(chunk, schema=parquet_schema)
>     parquet_writer.write_table(table)
> 
> Any suggestions would be much appreciated.
> 
> Running pyarrow=0.4.1=np112py36_1 installed using conda on Linux Mint
> 18.1
> 
> And thanks a lot for developing pyarrow.parquet!
> Alexey
> ​

Reply via email to