Thank you both. I hadn't read the IPC documentation closely enough to understand that it supported metadata at the message level. It seems like the best approach in my case is then probably to flush the dataset to separate files as a large number of IPC message batches, and send the schema and the complete version of the dictionary as just one message each.
On Thu, Oct 8, 2020 at 12:28 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > I can't speak to whether Pandas conversion will ever change. Some one else > can potentially chime in I don't recall any JIRAs recently changing this > type of conversion, however currently for library functionality there > aren't any hard guarantees for backwards compatibility (generally we try to > do our best to not break things). > > I can see that the right way here might be to use the IPC streaming format > > rather than feather, and send out a single schema for the dataset, with > > dictionary batches identifying the keys. > > > Feather V2 should be the same as the Arrow file format which is different > then the stream format. There is a direct writer [1] for this as well, so > if you have the ability to construct your arrow tables directly from the > same dictionary, this would be the best way of ensuring any changes to the > Pandas conversion would not impact you. > > [1] > > https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-random-access-files > > On Wed, Oct 7, 2020 at 10:44 AM Jacob Quinn <quinn.jac...@gmail.com> > wrote: > > > > > > > But I'm also attaching table > > > metadata to each feather, which I'd hate to lose. > > > > > > > Note the arrow format allows attaching custom metadata at the column > > (field), schema, and message level, so it should be possible to retain > any > > metadata this way. > > > > -Jacob > > > > On Wed, Oct 7, 2020 at 11:38 AM Benjamin MacDonald Schmidt < > > bmschm...@gmail.com> wrote: > > > > > Hello, > > > > > > Exciting project, thanks for all your work. I gather it's appropriate > to > > > ask a use question here? Assuming so: > > > > > > I have a web application that serves portions of a dataset I've broken > > into > > > a few thousand featherV2 files structured as a quadtree. The structure > > > makes heavy use of text dictionary types; I'd like to have each > > dictionary > > > integer map to the same string across all files so that I can ship the > > data > > > for each tile straight to GPU without decoding the text. > > > > > > If you slice a portion of a pandas categorical array and coerce to an > > arrow > > > dictionary, you keep the underlying pandas integer encoding; for > example, > > > the last line here shows a dictionary with four keys even though the > > table > > > has just one row. > > > > > > ``` > > > import pandas as pd > > > import pyarrow as pa > > > pandas_cat = pd.Series(["A", "B", "C", "B", "F"], dtype = "category") > > > pa.Array.from_pandas(pandas_cat[2:3]) > > > ``` > > > > > > For my purposes, this is good! But of course it's wasteful, too. So I'm > > > wondering: > > > > > > 1. Whether it's safe to count on the above code continuing to use the > > > internal pandas keys in the arrow output, or whether at some point it > > might > > > redo the pandas encoding in a more efficient way; > > > 2. Whether there's a native pyarrow way to ensure that multiple feather > > > dictionaries across files use the same integer identifiers for all the > > keys > > > that they share. > > > > > > I can see that the right way here might be to use the IPC streaming > > format > > > rather than feather, and send out a single schema for the dataset, with > > > dictionary batches identifying the keys. But I'm also attaching table > > > metadata to each feather, which I'd hate to lose. > > > > > > -- > > > Benjamin Schmidt > > > Director of Digital Humanities and Clinical Associate Professor of > > History > > > 20 Cooper Square, Room 538 > > > New York University > > > > > > <http://goog_1230609213> > > > benschmidt.org > > > > > > -- Benjamin Schmidt Director of Digital Humanities and Clinical Associate Professor of History 20 Cooper Square, Room 538 New York University <http://goog_1230609213> benschmidt.org