Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread Jorge Cardoso Leitão
We could use an extension type here: wrap the dictionary type on an extension type whose metadata contains the expected keys. This way the keys are stored in the schema. On Wed, Jan 5, 2022 at 11:32 PM Neal Richardson wrote: > For what it's worth, I encountered a similar issue in working on the

Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread Neal Richardson
For what it's worth, I encountered a similar issue in working on the R bindings: if you're querying a dataset or filtering a dictionary array and you end up with a ChunkedArray with 0 chunks, you can't populate the factor levels when converting to R because the type doesn't have the dictionary valu

Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread Rok Mihevc
How big are your dictionaries typically? What are your upper and lower bounds? On Wed, Jan 5, 2022 at 10:22 PM David Li wrote: > > Ah, thank you for the clarification. Indeed, Arrow dictionaries don't make > the dictionary part of the schema itself (and the format even allows for > dictionaries

Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread David Li
Ah, thank you for the clarification. Indeed, Arrow dictionaries don't make the dictionary part of the schema itself (and the format even allows for dictionaries to be updated over time). I wonder if the dictionary type could be extended to handle this; alternatively, passing around explicit dict

Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread Sam Davis
Hi Rok, David, I think the problem is that the DictionaryType loses the semantic information about the categories. Right now I define the schema for the tables and have logic to parse files/receive data and convert it into RecordBatchs ready for writing. This is quite simple: for each row we g

Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread David Li
Hi Sam, For categoricals, you likely want an Arrow dictionary array. (See docs at [1].) For example: >>> import pyarrow as pa >>> ty = pa.dictionary(pa.int8(), pa.string()) >>> arr = pa.array(["a", "a", None, "d"], type=ty) >>> arr -- dictionary: [ "a", "d" ] -- indices: [ 0,

Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread Rok Mihevc
Hey Sam, Did you consider DictionaryArray? (https://arrow.apache.org/docs/python/data.html#dictionary-arrays) It's to_pandas will return pd.Categorical. Rok On Wed, Jan 5, 2022 at 3:35 PM Sam Davis wrote: > > Hi, > > I'm looking at defining a schema for a table where one of the values is > inh

[Question][Python] Columns with Limited Value Set

2022-01-05 Thread Sam Davis
Hi, I'm looking at defining a schema for a table where one of the values is inherently categorical/enumerable and we're ultimately ending up loading it as a Pandas DataFrame. I cannot seem to find a decent way of achieving this. For example, the column may always be known to contain the values