Hi Wes,
Yes, the UTF8 ConvertedType is what I was after. Thanks for the helpful
references.
I don't have a good feel for how common this is but the following test file
caused my confusion between UTF8 and Binary types in Arrow:
https://github.com/dask/fastparquet/blob/master/test-data/natio
hi Hatem,
Are you talking about the UTF8 ConvertedType in Parquet?
https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L52
AFAIK we do respect that if it is set, otherwise we do not guess
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/schema.cc#L65
- W
Thanks Antoine, that makes good sense.
We are writing string data using the utf8 data type. This question came up
when trying to read this fastparquet project test file into arrow memory:
fastparquet/test-data/nation.dict.parquet
The name and comment columns results in a binary d
Hi Hatem,
It is intended that the convention is application-dependent. From
Arrow's point of view, the binary string is an opaque blob of data.
Depending on your application, it might be an UTF16-encoded piece of
text, a JPEG image, anything.
By the way, if you store ASCII text data, I would r