Jörn Horstmann created ARROW-8791:
-------------------------------------

             Summary: [RUST] Creating StringDictionaryBuilder with existing 
dictionary values
                 Key: ARROW-8791
                 URL: https://issues.apache.org/jira/browse/ARROW-8791
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Rust
            Reporter: Jörn Horstmann


It might be useful to create a DictionaryArray that uses the same dictionary 
keys as another array. One usecase would be more efficient comparison between 
arrays if it is known that they use the same dictionary. Another could be more 
efficient grouping operations, across multiple chunks (ie a 
`Vec<DictionaryArray>`).

 

A possible implementation could look like this:

 
{code:java}
impl<K> StringDictionaryBuilder<K>
where
    K: ArrowDictionaryKeyType,
{
    pub fn new_with_dictionary(
        keys_builder: PrimitiveBuilder<K>,
        dictionary_values: &StringArray,
    ) -> Result<Self> {
        let mut values_builder = StringBuilder::with_capacity(
            dictionary_values.len(),
            dictionary_values.value_data().len(),
        );
        let mut map: HashMap<Box<[u8]>, K::Native> = HashMap::new();
        for i in 0..dictionary_values.len() {
            if dictionary_values.is_valid(i) {
                let value = dictionary_values.value(i);
                map.insert(
                    value.as_bytes().into(),
                    K::Native::from_usize(i)
                        .ok_or(ArrowError::DictionaryKeyOverflowError)?,
                );
                values_builder.append_value(value);
            } else {
                values_builder.append_null();
            }
        }
        Ok(Self {
            keys_builder,
            values_builder,
            map,
        })
    }
}{code}
I don't really like here that the map has to be reconstructed, maybe there is a 
more efficient way by passing in the HashMap directly, but it's probably not a 
good idea to expose the `Box<[u8]>` encoding of its keys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to