date:20220105

Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread Jorge Cardoso Leitão

We could use an extension type here: wrap the dictionary type on an extension type whose metadata contains the expected keys. This way the keys are stored in the schema. On Wed, Jan 5, 2022 at 11:32 PM Neal Richardson wrote: > For what it's worth, I encountered a similar issue in working on the

Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread Neal Richardson

For what it's worth, I encountered a similar issue in working on the R bindings: if you're querying a dataset or filtering a dictionary array and you end up with a ChunkedArray with 0 chunks, you can't populate the factor levels when converting to R because the type doesn't have the dictionary valu

Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread Rok Mihevc

How big are your dictionaries typically? What are your upper and lower bounds? On Wed, Jan 5, 2022 at 10:22 PM David Li wrote: > > Ah, thank you for the clarification. Indeed, Arrow dictionaries don't make > the dictionary part of the schema itself (and the format even allows for > dictionaries

Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread David Li

Ah, thank you for the clarification. Indeed, Arrow dictionaries don't make the dictionary part of the schema itself (and the format even allows for dictionaries to be updated over time). I wonder if the dictionary type could be extended to handle this; alternatively, passing around explicit dict

Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread Sam Davis

Hi Rok, David, I think the problem is that the DictionaryType loses the semantic information about the categories. Right now I define the schema for the tables and have logic to parse files/receive data and convert it into RecordBatchs ready for writing. This is quite simple: for each row we g

Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread David Li

Hi Sam, For categoricals, you likely want an Arrow dictionary array. (See docs at [1].) For example: >>> import pyarrow as pa >>> ty = pa.dictionary(pa.int8(), pa.string()) >>> arr = pa.array(["a", "a", None, "d"], type=ty) >>> arr -- dictionary: [ "a", "d" ] -- indices: [ 0,

Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread Rok Mihevc

Hey Sam, Did you consider DictionaryArray? (https://arrow.apache.org/docs/python/data.html#dictionary-arrays) It's to_pandas will return pd.Categorical. Rok On Wed, Jan 5, 2022 at 3:35 PM Sam Davis wrote: > > Hi, > > I'm looking at defining a schema for a table where one of the values is > inh

[Question][Python] Columns with Limited Value Set

2022-01-05 Thread Sam Davis

Hi, I'm looking at defining a schema for a table where one of the values is inherently categorical/enumerable and we're ultimately ending up loading it as a Pandas DataFrame. I cannot seem to find a decent way of achieving this. For example, the column may always be known to contain the values

Re: [Question][Python] Columns with Limited Value Set

Re: [Question][Python] Columns with Limited Value Set

Re: [Question][Python] Columns with Limited Value Set

Re: [Question][Python] Columns with Limited Value Set

Re: [Question][Python] Columns with Limited Value Set

Re: [Question][Python] Columns with Limited Value Set

Re: [Question][Python] Columns with Limited Value Set

[Question][Python] Columns with Limited Value Set

8 matches

Site Navigation

Mail list logo

Footer information