Martin Durant created ARROW-3246:
------------------------------------

             Summary: direct reading/writing of pandas categoricals
                 Key: ARROW-3246
                 URL: https://issues.apache.org/jira/browse/ARROW-3246
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Martin Durant


Parquet supports "dictionary encoding" of column data in a manner very similar 
to the concept of Categoricals in pandas. It is natural to use this encoding 
for a column which originated as a categorical. Conversely, when loading, if 
the file metadata says that a given column came from a pandas (or arrow) 
categorical, then we can trust that the whole of the column is 
dictionary-encoded and load the data directly into a categorical column, rather 
than expanding the labels upon load and recategorising later.

If the data does not have the pandas metadata, then the guarantee cannot hold, 
and we cannot assume either that the whole column is dictionary encoded or that 
the labels are the same throughout. In this case, the current behaviour is fine.

 

(please forgive that some of this has already been mentioned elsewhere; this is 
one of the entries in the list at 
[https://github.com/dask/fastparquet/issues/374] as a feature that is useful in 
fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to