People ran into similar issues with such all-NA columns with Parquet
as well (with the difference that Parquet actually supports a null
type, but if you have a partitioned dataset, this could lead to
conflicting schemas). The typical workaround for the user to provide
the schema when writing / converting the data to Arrow. For example,
for this reason, dask added a "schema" keyword to their "to_parquet"
function 
(https://docs.dask.org/en/latest/generated/dask.dataframe.to_parquet.html),
which also allowed to specify the type for just one column, leaving
the others to use the normal type inference.

Now, for ORC writing in Arrow itself, I agree it would be good to
provide a way to write a column of null type.

On Mon, 22 Nov 2021 at 10:52, Antoine Pitrou <anto...@python.org> wrote:
>
>
> Le 21/11/2021 à 19:48, Ian Joiner a écrit :
> > I see.
> >
> > Now the question is what we should do about such columns in the ORC writer
> > as well as maybe some other writers since the Null type, as opposed to all
> > Null columns of a numeric or binary type, doesn’t exist in such formats.
>
> We could perhaps add an option to silently turn them into another type,
> but they wouldn't roundtrip properly unless we also serialize the Arrow
> schema as we do in Parquet.

Storing the schema similarly as we do for Parquet might be a good idea
in general to improve roundtripping? Nor only for null type, but eg
also for timestamp resolution and timezones.

>
> For now, people will have to detect such columns and cast them manually,
> I think.
>
> Regards
>
> Antoine.

Reply via email to