We could use an extension type here: wrap the dictionary type on an
extension type whose metadata contains the expected keys. This way the keys
are stored in the schema.
On Wed, Jan 5, 2022 at 11:32 PM Neal Richardson
wrote:
> For what it's worth, I encountered a similar issue in working on the
For what it's worth, I encountered a similar issue in working on the R
bindings: if you're querying a dataset or filtering a dictionary array and
you end up with a ChunkedArray with 0 chunks, you can't populate the factor
levels when converting to R because the type doesn't have the dictionary
valu
How big are your dictionaries typically? What are your upper and lower bounds?
On Wed, Jan 5, 2022 at 10:22 PM David Li wrote:
>
> Ah, thank you for the clarification. Indeed, Arrow dictionaries don't make
> the dictionary part of the schema itself (and the format even allows for
> dictionaries
Ah, thank you for the clarification. Indeed, Arrow dictionaries don't make the
dictionary part of the schema itself (and the format even allows for
dictionaries to be updated over time). I wonder if the dictionary type could be
extended to handle this; alternatively, passing around explicit dict
Hi Rok, David,
I think the problem is that the DictionaryType loses the semantic information
about the categories.
Right now I define the schema for the tables and have logic to parse
files/receive data and convert it into RecordBatchs ready for writing. This is
quite simple: for each row we g
Hi Sam,
For categoricals, you likely want an Arrow dictionary array. (See docs at [1].)
For example:
>>> import pyarrow as pa
>>> ty = pa.dictionary(pa.int8(), pa.string())
>>> arr = pa.array(["a", "a", None, "d"], type=ty)
>>> arr
-- dictionary:
[
"a",
"d"
]
-- indices:
[
0,
Hey Sam,
Did you consider DictionaryArray?
(https://arrow.apache.org/docs/python/data.html#dictionary-arrays)
It's to_pandas will return pd.Categorical.
Rok
On Wed, Jan 5, 2022 at 3:35 PM Sam Davis wrote:
>
> Hi,
>
> I'm looking at defining a schema for a table where one of the values is
> inh
Hi,
I'm looking at defining a schema for a table where one of the values is
inherently categorical/enumerable and we're ultimately ending up loading it as
a Pandas DataFrame. I cannot seem to find a decent way of achieving this.
For example, the column may always be known to contain the values