I'd be happy to pitch in on getting the integration tests developed. It
would certainly beat my current method of building and running my test
project and switching over to a Jupyter notebook to manually check it.

Is there any prior work in the Rust project that I could basically copy
from? Or perhaps the C++ implementation?

On Thu, Apr 9, 2020 at 6:31 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Paul,
>
> Dictionary-encoded is not a nested type, so there shouldn't be any
> children -- the IPC layout of a dictionary encoded field is that same
> as the type of the indices (probably want to change the terminology in
> the Rust library from "keys" to "indices" which is what's used in the
> specification). And the dictionaries are sent separately in record
> batches
>
> > The Rust implementation of DictionaryArray keeps the values as a child
> to the keys array.
>
> What we do in C++ is that the dictionary is a separate member of the
> ArrayData data structure. I don't know enough about the Rust library
> to know whether that's the right design choice or not.
>
> Implementing Rust support for integration tests would make it much
> easier to validate that things are implemented correctly. Does anyone
> know how far away the Rust library might be from having them? It's not
> the most fun thing to implement, but it's very important.
>
> - Wes
>
> On Thu, Apr 9, 2020 at 5:08 PM Paul Dix <p...@influxdata.com> wrote:
> >
> > I managed to get something up and running. I ended up creating a
> > dictionary_batch.rs and adding that to convert.rs to translate
> dictionary
> > fields in a schema over to the correct fb thing. I also added a method to
> > writer.rs to convert that to bytes so it can be sent via ipc. However,
> when
> > writing out the subsequent record batches that contain a dictionary field
> > that references the dictionary sent in the dictionary batch, it runs
> into a
> > problem. The Rust implementation of DictionaryArray keeps the values as a
> > child to the keys array.
> >
> > You can see the chain of calls when calling finish on the
> > StringDictionaryBuilder
> >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316
> > which calls out to finish the dictionary
> >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482
> > which then calls from to create the DictionaryArray
> >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927
> > And that last part validates that the child is there.
> >
> > And here in the writer is where it blindly writes out any children:
> >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378
> >
> > I found that when I sent this out, the Python reader on the other side
> > would choke on it when I tried to read_pandas. However, if I skip
> > serializing the children for DictionaryArray only, then everything works
> > exactly as expected.
> >
> > So I'm not sure what the right thing here is because I don't know if
> > DictionaryArrays should not include the children (i.e. values) when being
> > written out or if there's some other thing that should be happening.
> Should
> > the raw arrow data have the values in that dictionary? Or does it get
> > reconstituted manually on the other side from the DictionaryBatch?
> >
> > On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > As another item for consideration -- in C++ at least, the dictionary
> > > id is dealt with as an internal detail of the IPC message production
> > > process. When serializing the Schema, id's are assigned to each
> > > dictionary-encoded field in the DictionaryMemo object, see
> > >
> > >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h
> > >
> > > When record batches are reconstructed, the dictionary corresponding to
> > > an id at the time of reconstruction is set in the Array's internal
> > > data -- that's the "dictionary" member of the ArrayData object
> > > (
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231).
> > >
> > > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > > >
> > > > hey Paul,
> > > >
> > > > Take a look at how dictionaries work in the IPC protocol
> > > >
> > > >
> > >
> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc
> > > >
> > > > Dictionaries are sent as separate messages. When a field is tagged as
> > > > dictionary encoded in the schema, the IPC reader must keep track of
> > > > the dictionaries it's seen come across the protocol and then set them
> > > > in the reconstructed record batches when a record batch comes
> through.
> > > >
> > > > Note that the protocol now supports dictionary deltas (dictionaries
> > > > can be appended to by subsequent messages for the same dictionary id)
> > > > and replacements (new dictionary for an id).
> > > >
> > > > I don't know what the status of handling dictionaries in the Rust
> IPC,
> > > > but it would be a good idea to take time to take into account the
> > > > above details.
> > > >
> > > > Finally, note that Rust is not participating in either the regular
> IPC
> > > > nor Flight integration tests. This is an important milestone to being
> > > > able to depend on the Rust library in production.
> > > >
> > > > Thanks
> > > > Wes
> > > >
> > > > On Tue, Apr 7, 2020 at 10:36 AM Paul Dix <p...@influxdata.com>
> wrote:
> > > > >
> > > > > Hello,
> > > > > I'm trying to build a Rust based Flight server and I'd like to use
> > > > > Dictionary encoding for a number of string columns in my data. I've
> > > seen
> > > > > that StringDictionary was recently added to Rust here:
> > > > >
> > >
> https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab
> > > .
> > > > >
> > > > > However, that doesn't seem to reach down into Flight. When I
> attempt to
> > > > > send a schema through flight that has a Dictionary<UInt8, Utf8> it
> > > throws
> > > > > an error when attempting to convert from the Rust type to the
> > > Flatbuffer
> > > > > field type. I figured I'd take a swing at adding that to
> convert.rs
> > > here:
> > > > >
> > >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319
> > > > >
> > > > > However, when I look at the definitions in Schema.fbs and the
> related
> > > > > generated Rust file, Dictionary isn't a type there. Should I be
> sending
> > > > > this down as some other composed type? And if so, how does this
> look
> > > at the
> > > > > client side of things? In my test I'm connecting to the Flight
> server
> > > via
> > > > > PyArrow and working with it in Pandas so I'm hoping that it will be
> > > able to
> > > > > consume Dictionary fields.
> > > > >
> > > > > Separately, the Rust field type doesn't have a spot for the
> dictionary
> > > ID,
> > > > > which I assume I'll need to send down so it can be consumed on the
> > > client.
> > > > > Would appreciate any thoughts on that. A little push in the right
> > > direction
> > > > > and I'll be happy to submit a PR to help push the Rust Flight
> > > > > implementation farther along.
> > > > >
> > > > > Thanks,
> > > > > Paul
> > >
> >
> >
> > --
> >
> > Paul Dix
> >
> > Founder & CTO • InfluxData
> >
> > m 646.283.6377     t @pauldix     w influxdata.com
> > <https://www.influxdata.com/>
>


-- 

Paul Dix

Founder & CTO • InfluxData

m 646.283.6377     t @pauldix     w influxdata.com
<https://www.influxdata.com/>

Reply via email to