Well, luckily we have some newly spruced up documentation about how integration testing works (thanks Neal!)
https://github.com/apache/arrow/blob/master/docs/source/format/Integration.rst The main task is writing a parser for the JSON format used for integration testing. The JSON is used to communicate the "truth" about the data between two implementations. There are implementations of this in C++, Go, JS, and Java, so any of those implementations could be used as your guide. One small reminder for others reading: the current JSON integration testing format doesn't yet capture dictionary deltas/replacements. See ARROW-5338 On Thu, Apr 9, 2020 at 5:55 PM Paul Dix <p...@influxdata.com> wrote: > > I'd be happy to pitch in on getting the integration tests developed. It > would certainly beat my current method of building and running my test > project and switching over to a Jupyter notebook to manually check it. > > Is there any prior work in the Rust project that I could basically copy > from? Or perhaps the C++ implementation? > > On Thu, Apr 9, 2020 at 6:31 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > hi Paul, > > > > Dictionary-encoded is not a nested type, so there shouldn't be any > > children -- the IPC layout of a dictionary encoded field is that same > > as the type of the indices (probably want to change the terminology in > > the Rust library from "keys" to "indices" which is what's used in the > > specification). And the dictionaries are sent separately in record > > batches > > > > > The Rust implementation of DictionaryArray keeps the values as a child > > to the keys array. > > > > What we do in C++ is that the dictionary is a separate member of the > > ArrayData data structure. I don't know enough about the Rust library > > to know whether that's the right design choice or not. > > > > Implementing Rust support for integration tests would make it much > > easier to validate that things are implemented correctly. Does anyone > > know how far away the Rust library might be from having them? It's not > > the most fun thing to implement, but it's very important. > > > > - Wes > > > > On Thu, Apr 9, 2020 at 5:08 PM Paul Dix <p...@influxdata.com> wrote: > > > > > > I managed to get something up and running. I ended up creating a > > > dictionary_batch.rs and adding that to convert.rs to translate > > dictionary > > > fields in a schema over to the correct fb thing. I also added a method to > > > writer.rs to convert that to bytes so it can be sent via ipc. However, > > when > > > writing out the subsequent record batches that contain a dictionary field > > > that references the dictionary sent in the dictionary batch, it runs > > into a > > > problem. The Rust implementation of DictionaryArray keeps the values as a > > > child to the keys array. > > > > > > You can see the chain of calls when calling finish on the > > > StringDictionaryBuilder > > > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316 > > > which calls out to finish the dictionary > > > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482 > > > which then calls from to create the DictionaryArray > > > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927 > > > And that last part validates that the child is there. > > > > > > And here in the writer is where it blindly writes out any children: > > > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378 > > > > > > I found that when I sent this out, the Python reader on the other side > > > would choke on it when I tried to read_pandas. However, if I skip > > > serializing the children for DictionaryArray only, then everything works > > > exactly as expected. > > > > > > So I'm not sure what the right thing here is because I don't know if > > > DictionaryArrays should not include the children (i.e. values) when being > > > written out or if there's some other thing that should be happening. > > Should > > > the raw arrow data have the values in that dictionary? Or does it get > > > reconstituted manually on the other side from the DictionaryBatch? > > > > > > On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > > > > As another item for consideration -- in C++ at least, the dictionary > > > > id is dealt with as an internal detail of the IPC message production > > > > process. When serializing the Schema, id's are assigned to each > > > > dictionary-encoded field in the DictionaryMemo object, see > > > > > > > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h > > > > > > > > When record batches are reconstructed, the dictionary corresponding to > > > > an id at the time of reconstruction is set in the Array's internal > > > > data -- that's the "dictionary" member of the ArrayData object > > > > ( > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231). > > > > > > > > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > > > > > > > hey Paul, > > > > > > > > > > Take a look at how dictionaries work in the IPC protocol > > > > > > > > > > > > > > > > https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc > > > > > > > > > > Dictionaries are sent as separate messages. When a field is tagged as > > > > > dictionary encoded in the schema, the IPC reader must keep track of > > > > > the dictionaries it's seen come across the protocol and then set them > > > > > in the reconstructed record batches when a record batch comes > > through. > > > > > > > > > > Note that the protocol now supports dictionary deltas (dictionaries > > > > > can be appended to by subsequent messages for the same dictionary id) > > > > > and replacements (new dictionary for an id). > > > > > > > > > > I don't know what the status of handling dictionaries in the Rust > > IPC, > > > > > but it would be a good idea to take time to take into account the > > > > > above details. > > > > > > > > > > Finally, note that Rust is not participating in either the regular > > IPC > > > > > nor Flight integration tests. This is an important milestone to being > > > > > able to depend on the Rust library in production. > > > > > > > > > > Thanks > > > > > Wes > > > > > > > > > > On Tue, Apr 7, 2020 at 10:36 AM Paul Dix <p...@influxdata.com> > > wrote: > > > > > > > > > > > > Hello, > > > > > > I'm trying to build a Rust based Flight server and I'd like to use > > > > > > Dictionary encoding for a number of string columns in my data. I've > > > > seen > > > > > > that StringDictionary was recently added to Rust here: > > > > > > > > > > > > https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab > > > > . > > > > > > > > > > > > However, that doesn't seem to reach down into Flight. When I > > attempt to > > > > > > send a schema through flight that has a Dictionary<UInt8, Utf8> it > > > > throws > > > > > > an error when attempting to convert from the Rust type to the > > > > Flatbuffer > > > > > > field type. I figured I'd take a swing at adding that to > > convert.rs > > > > here: > > > > > > > > > > > > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319 > > > > > > > > > > > > However, when I look at the definitions in Schema.fbs and the > > related > > > > > > generated Rust file, Dictionary isn't a type there. Should I be > > sending > > > > > > this down as some other composed type? And if so, how does this > > look > > > > at the > > > > > > client side of things? In my test I'm connecting to the Flight > > server > > > > via > > > > > > PyArrow and working with it in Pandas so I'm hoping that it will be > > > > able to > > > > > > consume Dictionary fields. > > > > > > > > > > > > Separately, the Rust field type doesn't have a spot for the > > dictionary > > > > ID, > > > > > > which I assume I'll need to send down so it can be consumed on the > > > > client. > > > > > > Would appreciate any thoughts on that. A little push in the right > > > > direction > > > > > > and I'll be happy to submit a PR to help push the Rust Flight > > > > > > implementation farther along. > > > > > > > > > > > > Thanks, > > > > > > Paul > > > > > > > > > > > > > -- > > > > > > Paul Dix > > > > > > Founder & CTO • InfluxData > > > > > > m 646.283.6377 t @pauldix w influxdata.com > > > <https://www.influxdata.com/> > > > > > -- > > Paul Dix > > Founder & CTO • InfluxData > > m 646.283.6377 t @pauldix w influxdata.com > <https://www.influxdata.com/>