[ https://issues.apache.org/jira/browse/BEAM-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526590#comment-17526590 ]
Beam JIRA Bot commented on BEAM-12169: -------------------------------------- This issue was marked "stale-assigned" and has not received a public comment in 7 days. It is now automatically unassigned. If you are still working on it, you can assign it to yourself again. Please also give an update about the status of the work. > Allow non-deferred column operations on categorical columns > ----------------------------------------------------------- > > Key: BEAM-12169 > URL: https://issues.apache.org/jira/browse/BEAM-12169 > Project: Beam > Issue Type: Improvement > Components: dsl-dataframe, sdk-py-core > Reporter: Brian Hulette > Priority: P3 > Labels: dataframe-api > Time Spent: 6h 50m > Remaining Estimate: 0h > > There are several operations that we currently disallow because they produce > a variable set of columns in the output based on the data > (non-deferred-columns). However, for some dtypes (categorical, boolean) we > can easily enumerate all the possible values that will be seen at execution > time, so we can predict the columns that will be seen. > Note we still can't implement these operations 100% correctly, as pandas will > typically only create columns for the values that are {_}observed{_}, while > we'd have to create a column for every possible value. > We should allow these operations in these special cases. > Operations in this category: > - DataFrame.unstack, Series.unstack (can work if unstacked level is a > categorical or boolean column) > - Series.str.get_dummies > - Series.str.split > - Series.str.rsplit > - DataFrame.pivot > - DataFrame.pivot_table -- This message was sent by Atlassian Jira (v8.20.7#820007)