Re: Error when converting csv to parquet in chunks, with the first chunk being all nulls

Alexey Strokach Mon, 10 Jul 2017 13:24:52 -0700

OK, awesome!

Thanks for the reply.


On Mon, Jul 10, 2017 at 1:42 PM, Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Alexey,
>
> you discovered a known bug in 0.4.1. If a column is only made up of None
> objects, then writing to Parquet fails. This is fixed upstream and will
> be included in the upcoming 0.5.0 release.
>
> Uwe
>
>
> On Sat, Jul 8, 2017, at 04:32 AM, Alexey Strokach wrote:
> > I am running into a problem converting a csv file into a parquet file in
> > chunks, where one of the string columns is null for the first several
> > million rows.
> >
> > Self-contained dummy example:
> >
> > csv_file = '/tmp/df.csv'
> > parquet_file = '/tmp/df.parquet'
> >
> > df = pd.DataFrame([np.nan] * 3 + ['hello'], columns=['a'])
> > df.to_csv(csv_file, index=False, na_rep='.')
> > display(df)
> >
> > for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=2,
> > na_values=['.'], dtype={'a': str})):
> >     print(i)
> >     display(chunk)
> >     if i == 0:
> >         parquet_schema = pa.Table.from_pandas(chunk).schema
> >         parquet_writer = pq.ParquetWriter(parquet_file,
> > parquet_schema, compression='snappy')
> >     table = pa.Table.from_pandas(chunk, schema=parquet_schema)
> >     parquet_writer.write_table(table)
> >
> > Any suggestions would be much appreciated.
> >
> > Running pyarrow=0.4.1=np112py36_1 installed using conda on Linux Mint
> > 18.1
> >
> > And thanks a lot for developing pyarrow.parquet!
> > Alexey
> > 
>

Re: Error when converting csv to parquet in chunks, with the first chunk being all nulls

Reply via email to