Thomas Buhrmann created ARROW-7168:
--------------------------------------

             Summary: pa.array() doesn't respect provided dictionary type with 
all NaNs
                 Key: ARROW-7168
                 URL: https://issues.apache.org/jira/browse/ARROW-7168
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
    Affects Versions: 0.15.1
            Reporter: Thomas Buhrmann


This might be related to ARROW-6548 and others dealing with all NaN columns. 
When creating a dictionary array, even when fully specifying the desired type, 
this type is not respected when the data contains only NaNs:


{code:python}
# This may look a little artificial but easily occurs when processing 
categorial data in batches and a particular batch containing only NaNs
ser = pd.Series([None, None]).astype('object').astype('category')
typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), ordered=False)
pa.array(ser, type=typ).type
{code}

results in

{noformat}
>> DictionaryType(dictionary<values=null, indices=int8, ordered=0>)
{noformat}

which means that one cannot e.g. serialize batches of categoricals if the 
possibility of all-NaN batches exists, even when trying to enforce that each 
batch has the same schema (because the schema is not respected).

I understand that inferring the type in this case would be difficult, but I'd 
imagine that a fully specified type should be respected in this case?

In the meantime, is there a workaround to manually create a dictionary array of 
the desired type containing only NaNs?




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to