Agreed. I've opened https://issues.apache.org/jira/browse/ARROW-7302 to track it.
Regards Antoine. Le 03/12/2019 à 04:55, Wes McKinney a écrit : > An option was recently added to dictionary encode all string columns > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/options.h#L82 > > I think it would be useful to be able to hard-opt-in to > dictionary-encode a particular column (regardless of the what > cardinality ends up being). Whatever the way to do this, it should be > clear and well documented. A new JIRA issue may be in order. Antoine, > what do you think? > > On Sun, Dec 1, 2019 at 5:32 PM ntfs hard <ntfs.h...@gmail.com> wrote: >> >> Hello >> >> I'm a newcomer and not quite sure about the library usage. I tried to find >> some documentation about it but failed. >> >> I have a dataset in CSV file where one column(let's call it colour) is a >> string category. I'd like to get indices instead of text_lines to pass it >> inside algorithm. >> I tried to set column_types in ConvertOptions in >> {{"colour", arrow::dictionary(std::make_shared<arrow::Int32Type>(), >> arrow::utf8()) }} but it seems to be not right api usage, a wild run-time >> error appears: NotImplemented: CSV conversion to dictionary<values=string, >> indices=int32, ordered=0> is not supported >> Also I find a merged PR #5785 <https://github.com/apache/arrow/pull/5785> but >> not quite sure that's applicable for my case. >> >> So, my question is: can I get indices inside a category column only w/ >> library API. And if yes, what I doing wrong. :) >> >> *In other word,* I'd like to something like such python pandas code: >> df[column] = df[column].cat.codes # if str(column_data_type) == "category" >> >> Thank you!