Hello,

Exciting project, thanks for all your work. I gather it's appropriate to
ask a use question here? Assuming so:

I have a web application that serves portions of a dataset I've broken into
a few thousand featherV2 files structured as a quadtree. The structure
makes heavy use of text dictionary types; I'd like to have each dictionary
integer map to the same string across all files so that I can ship the data
for each tile straight to GPU without decoding the text.

If you slice a portion of a pandas categorical array and coerce to an arrow
dictionary, you keep the underlying pandas integer encoding; for example,
the last line here shows a dictionary with four keys even though the table
has just one row.

```
import pandas as pd
import pyarrow as pa
pandas_cat = pd.Series(["A", "B", "C", "B", "F"], dtype = "category")
pa.Array.from_pandas(pandas_cat[2:3])
```

For my purposes, this is good! But of course it's wasteful, too. So I'm
wondering:

1. Whether it's safe to count on the above code continuing to use the
internal pandas keys in the arrow output, or whether at some point it might
redo the pandas encoding in a more efficient way;
2. Whether there's a native pyarrow way to ensure that multiple feather
dictionaries across files use the same integer identifiers for all the keys
that they share.

I can see that the right way here might be to use the IPC streaming format
rather than feather, and send out a single schema for the dataset, with
dictionary batches identifying the keys. But I'm also attaching table
metadata to each feather, which I'd hate to lose.

-- 
Benjamin Schmidt
Director of Digital Humanities and Clinical Associate Professor of History
20 Cooper Square, Room 538
New York University

<http://goog_1230609213>
benschmidt.org

Reply via email to