Well, luckily we have some newly spruced up documentation about how
integration testing works (thanks Neal!)

https://github.com/apache/arrow/blob/master/docs/source/format/Integration.rst

The main task is writing a parser for the JSON format used for
integration testing. The JSON is used to communicate the "truth" about
the data between two implementations. There are implementations of
this in C++, Go, JS, and Java, so any of those implementations could
be used as your guide.

One small reminder for others reading: the current JSON integration
testing format doesn't yet capture dictionary deltas/replacements. See
ARROW-5338

On Thu, Apr 9, 2020 at 5:55 PM Paul Dix <p...@influxdata.com> wrote:
>
> I'd be happy to pitch in on getting the integration tests developed. It
> would certainly beat my current method of building and running my test
> project and switching over to a Jupyter notebook to manually check it.
>
> Is there any prior work in the Rust project that I could basically copy
> from? Or perhaps the C++ implementation?
>
> On Thu, Apr 9, 2020 at 6:31 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hi Paul,
> >
> > Dictionary-encoded is not a nested type, so there shouldn't be any
> > children -- the IPC layout of a dictionary encoded field is that same
> > as the type of the indices (probably want to change the terminology in
> > the Rust library from "keys" to "indices" which is what's used in the
> > specification). And the dictionaries are sent separately in record
> > batches
> >
> > > The Rust implementation of DictionaryArray keeps the values as a child
> > to the keys array.
> >
> > What we do in C++ is that the dictionary is a separate member of the
> > ArrayData data structure. I don't know enough about the Rust library
> > to know whether that's the right design choice or not.
> >
> > Implementing Rust support for integration tests would make it much
> > easier to validate that things are implemented correctly. Does anyone
> > know how far away the Rust library might be from having them? It's not
> > the most fun thing to implement, but it's very important.
> >
> > - Wes
> >
> > On Thu, Apr 9, 2020 at 5:08 PM Paul Dix <p...@influxdata.com> wrote:
> > >
> > > I managed to get something up and running. I ended up creating a
> > > dictionary_batch.rs and adding that to convert.rs to translate
> > dictionary
> > > fields in a schema over to the correct fb thing. I also added a method to
> > > writer.rs to convert that to bytes so it can be sent via ipc. However,
> > when
> > > writing out the subsequent record batches that contain a dictionary field
> > > that references the dictionary sent in the dictionary batch, it runs
> > into a
> > > problem. The Rust implementation of DictionaryArray keeps the values as a
> > > child to the keys array.
> > >
> > > You can see the chain of calls when calling finish on the
> > > StringDictionaryBuilder
> > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316
> > > which calls out to finish the dictionary
> > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482
> > > which then calls from to create the DictionaryArray
> > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927
> > > And that last part validates that the child is there.
> > >
> > > And here in the writer is where it blindly writes out any children:
> > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378
> > >
> > > I found that when I sent this out, the Python reader on the other side
> > > would choke on it when I tried to read_pandas. However, if I skip
> > > serializing the children for DictionaryArray only, then everything works
> > > exactly as expected.
> > >
> > > So I'm not sure what the right thing here is because I don't know if
> > > DictionaryArrays should not include the children (i.e. values) when being
> > > written out or if there's some other thing that should be happening.
> > Should
> > > the raw arrow data have the values in that dictionary? Or does it get
> > > reconstituted manually on the other side from the DictionaryBatch?
> > >
> > > On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >
> > > > As another item for consideration -- in C++ at least, the dictionary
> > > > id is dealt with as an internal detail of the IPC message production
> > > > process. When serializing the Schema, id's are assigned to each
> > > > dictionary-encoded field in the DictionaryMemo object, see
> > > >
> > > >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h
> > > >
> > > > When record batches are reconstructed, the dictionary corresponding to
> > > > an id at the time of reconstruction is set in the Array's internal
> > > > data -- that's the "dictionary" member of the ArrayData object
> > > > (
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231).
> > > >
> > > > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > > > >
> > > > > hey Paul,
> > > > >
> > > > > Take a look at how dictionaries work in the IPC protocol
> > > > >
> > > > >
> > > >
> > https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc
> > > > >
> > > > > Dictionaries are sent as separate messages. When a field is tagged as
> > > > > dictionary encoded in the schema, the IPC reader must keep track of
> > > > > the dictionaries it's seen come across the protocol and then set them
> > > > > in the reconstructed record batches when a record batch comes
> > through.
> > > > >
> > > > > Note that the protocol now supports dictionary deltas (dictionaries
> > > > > can be appended to by subsequent messages for the same dictionary id)
> > > > > and replacements (new dictionary for an id).
> > > > >
> > > > > I don't know what the status of handling dictionaries in the Rust
> > IPC,
> > > > > but it would be a good idea to take time to take into account the
> > > > > above details.
> > > > >
> > > > > Finally, note that Rust is not participating in either the regular
> > IPC
> > > > > nor Flight integration tests. This is an important milestone to being
> > > > > able to depend on the Rust library in production.
> > > > >
> > > > > Thanks
> > > > > Wes
> > > > >
> > > > > On Tue, Apr 7, 2020 at 10:36 AM Paul Dix <p...@influxdata.com>
> > wrote:
> > > > > >
> > > > > > Hello,
> > > > > > I'm trying to build a Rust based Flight server and I'd like to use
> > > > > > Dictionary encoding for a number of string columns in my data. I've
> > > > seen
> > > > > > that StringDictionary was recently added to Rust here:
> > > > > >
> > > >
> > https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab
> > > > .
> > > > > >
> > > > > > However, that doesn't seem to reach down into Flight. When I
> > attempt to
> > > > > > send a schema through flight that has a Dictionary<UInt8, Utf8> it
> > > > throws
> > > > > > an error when attempting to convert from the Rust type to the
> > > > Flatbuffer
> > > > > > field type. I figured I'd take a swing at adding that to
> > convert.rs
> > > > here:
> > > > > >
> > > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319
> > > > > >
> > > > > > However, when I look at the definitions in Schema.fbs and the
> > related
> > > > > > generated Rust file, Dictionary isn't a type there. Should I be
> > sending
> > > > > > this down as some other composed type? And if so, how does this
> > look
> > > > at the
> > > > > > client side of things? In my test I'm connecting to the Flight
> > server
> > > > via
> > > > > > PyArrow and working with it in Pandas so I'm hoping that it will be
> > > > able to
> > > > > > consume Dictionary fields.
> > > > > >
> > > > > > Separately, the Rust field type doesn't have a spot for the
> > dictionary
> > > > ID,
> > > > > > which I assume I'll need to send down so it can be consumed on the
> > > > client.
> > > > > > Would appreciate any thoughts on that. A little push in the right
> > > > direction
> > > > > > and I'll be happy to submit a PR to help push the Rust Flight
> > > > > > implementation farther along.
> > > > > >
> > > > > > Thanks,
> > > > > > Paul
> > > >
> > >
> > >
> > > --
> > >
> > > Paul Dix
> > >
> > > Founder & CTO • InfluxData
> > >
> > > m 646.283.6377     t @pauldix     w influxdata.com
> > > <https://www.influxdata.com/>
> >
>
>
> --
>
> Paul Dix
>
> Founder & CTO • InfluxData
>
> m 646.283.6377     t @pauldix     w influxdata.com
> <https://www.influxdata.com/>

Reply via email to