Re: Error when converting csv to parquet in chunks, with the first chunk being all nulls

Uwe L. Korn Mon, 10 Jul 2017 10:42:39 -0700

Hello Alexey, 

you discovered a known bug in 0.4.1. If a column is only made up of None
objects, then writing to Parquet fails. This is fixed upstream and will
be included in the upcoming 0.5.0 release.


Uwe


On Sat, Jul 8, 2017, at 04:32 AM, Alexey Strokach wrote:
> I am running into a problem converting a csv file into a parquet file in
> chunks, where one of the string columns is null for the first several
> million rows.
> 
> Self-contained dummy example:
> 
> csv_file = '/tmp/df.csv'
> parquet_file = '/tmp/df.parquet'
> 
> df = pd.DataFrame([np.nan] * 3 + ['hello'], columns=['a'])
> df.to_csv(csv_file, index=False, na_rep='.')
> display(df)
> 
> for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=2,
> na_values=['.'], dtype={'a': str})):
>     print(i)
>     display(chunk)
>     if i == 0:
>         parquet_schema = pa.Table.from_pandas(chunk).schema
>         parquet_writer = pq.ParquetWriter(parquet_file,
> parquet_schema, compression='snappy')
>     table = pa.Table.from_pandas(chunk, schema=parquet_schema)
>     parquet_writer.write_table(table)
> 
> Any suggestions would be much appreciated.
> 
> Running pyarrow=0.4.1=np112py36_1 installed using conda on Linux Mint
> 18.1
> 
> And thanks a lot for developing pyarrow.parquet!
> Alexey
>

Re: Error when converting csv to parquet in chunks, with the first chunk being all nulls

Reply via email to