I'd be happy to pitch in on getting the integration tests developed. It would certainly beat my current method of building and running my test project and switching over to a Jupyter notebook to manually check it.
Is there any prior work in the Rust project that I could basically copy from? Or perhaps the C++ implementation? On Thu, Apr 9, 2020 at 6:31 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi Paul, > > Dictionary-encoded is not a nested type, so there shouldn't be any > children -- the IPC layout of a dictionary encoded field is that same > as the type of the indices (probably want to change the terminology in > the Rust library from "keys" to "indices" which is what's used in the > specification). And the dictionaries are sent separately in record > batches > > > The Rust implementation of DictionaryArray keeps the values as a child > to the keys array. > > What we do in C++ is that the dictionary is a separate member of the > ArrayData data structure. I don't know enough about the Rust library > to know whether that's the right design choice or not. > > Implementing Rust support for integration tests would make it much > easier to validate that things are implemented correctly. Does anyone > know how far away the Rust library might be from having them? It's not > the most fun thing to implement, but it's very important. > > - Wes > > On Thu, Apr 9, 2020 at 5:08 PM Paul Dix <p...@influxdata.com> wrote: > > > > I managed to get something up and running. I ended up creating a > > dictionary_batch.rs and adding that to convert.rs to translate > dictionary > > fields in a schema over to the correct fb thing. I also added a method to > > writer.rs to convert that to bytes so it can be sent via ipc. However, > when > > writing out the subsequent record batches that contain a dictionary field > > that references the dictionary sent in the dictionary batch, it runs > into a > > problem. The Rust implementation of DictionaryArray keeps the values as a > > child to the keys array. > > > > You can see the chain of calls when calling finish on the > > StringDictionaryBuilder > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316 > > which calls out to finish the dictionary > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482 > > which then calls from to create the DictionaryArray > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927 > > And that last part validates that the child is there. > > > > And here in the writer is where it blindly writes out any children: > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378 > > > > I found that when I sent this out, the Python reader on the other side > > would choke on it when I tried to read_pandas. However, if I skip > > serializing the children for DictionaryArray only, then everything works > > exactly as expected. > > > > So I'm not sure what the right thing here is because I don't know if > > DictionaryArrays should not include the children (i.e. values) when being > > written out or if there's some other thing that should be happening. > Should > > the raw arrow data have the values in that dictionary? Or does it get > > reconstituted manually on the other side from the DictionaryBatch? > > > > On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > As another item for consideration -- in C++ at least, the dictionary > > > id is dealt with as an internal detail of the IPC message production > > > process. When serializing the Schema, id's are assigned to each > > > dictionary-encoded field in the DictionaryMemo object, see > > > > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h > > > > > > When record batches are reconstructed, the dictionary corresponding to > > > an id at the time of reconstruction is set in the Array's internal > > > data -- that's the "dictionary" member of the ArrayData object > > > ( > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231). > > > > > > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > > > > hey Paul, > > > > > > > > Take a look at how dictionaries work in the IPC protocol > > > > > > > > > > > > https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc > > > > > > > > Dictionaries are sent as separate messages. When a field is tagged as > > > > dictionary encoded in the schema, the IPC reader must keep track of > > > > the dictionaries it's seen come across the protocol and then set them > > > > in the reconstructed record batches when a record batch comes > through. > > > > > > > > Note that the protocol now supports dictionary deltas (dictionaries > > > > can be appended to by subsequent messages for the same dictionary id) > > > > and replacements (new dictionary for an id). > > > > > > > > I don't know what the status of handling dictionaries in the Rust > IPC, > > > > but it would be a good idea to take time to take into account the > > > > above details. > > > > > > > > Finally, note that Rust is not participating in either the regular > IPC > > > > nor Flight integration tests. This is an important milestone to being > > > > able to depend on the Rust library in production. > > > > > > > > Thanks > > > > Wes > > > > > > > > On Tue, Apr 7, 2020 at 10:36 AM Paul Dix <p...@influxdata.com> > wrote: > > > > > > > > > > Hello, > > > > > I'm trying to build a Rust based Flight server and I'd like to use > > > > > Dictionary encoding for a number of string columns in my data. I've > > > seen > > > > > that StringDictionary was recently added to Rust here: > > > > > > > > > https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab > > > . > > > > > > > > > > However, that doesn't seem to reach down into Flight. When I > attempt to > > > > > send a schema through flight that has a Dictionary<UInt8, Utf8> it > > > throws > > > > > an error when attempting to convert from the Rust type to the > > > Flatbuffer > > > > > field type. I figured I'd take a swing at adding that to > convert.rs > > > here: > > > > > > > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319 > > > > > > > > > > However, when I look at the definitions in Schema.fbs and the > related > > > > > generated Rust file, Dictionary isn't a type there. Should I be > sending > > > > > this down as some other composed type? And if so, how does this > look > > > at the > > > > > client side of things? In my test I'm connecting to the Flight > server > > > via > > > > > PyArrow and working with it in Pandas so I'm hoping that it will be > > > able to > > > > > consume Dictionary fields. > > > > > > > > > > Separately, the Rust field type doesn't have a spot for the > dictionary > > > ID, > > > > > which I assume I'll need to send down so it can be consumed on the > > > client. > > > > > Would appreciate any thoughts on that. A little push in the right > > > direction > > > > > and I'll be happy to submit a PR to help push the Rust Flight > > > > > implementation farther along. > > > > > > > > > > Thanks, > > > > > Paul > > > > > > > > > -- > > > > Paul Dix > > > > Founder & CTO • InfluxData > > > > m 646.283.6377 t @pauldix w influxdata.com > > <https://www.influxdata.com/> > -- Paul Dix Founder & CTO • InfluxData m 646.283.6377 t @pauldix w influxdata.com <https://www.influxdata.com/>