Ivan Sadikov created ARROW-5129:
-----------------------------------

             Summary: Column writer bug: check dictionary encoder when adding a 
new data page
                 Key: ARROW-5129
                 URL: https://issues.apache.org/jira/browse/ARROW-5129
             Project: Apache Arrow
          Issue Type: Bug
          Components: Rust
         Environment: N/A
            Reporter: Ivan Sadikov


As part of my weekly routine, I glanced over code in Parquet column writer and 
found that the way we check when to add a new data page is buggy. The idea is 
checking the current encoder and deciding if we have written enough bytes for a 
page to construct. The problem is that we only check value encoder, regardless 
whether or not dictionary encoder is enabled. 

Here is how we do it now: actual check 
(https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L378)
 and the buggy function 
(https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L423).
 

In the case of sparse column and dictionary  encoder we would write a single 
data page, even though we would have accumulated a large enough number of bytes 
for more than one page in encoder (value encoder will be empty, so it will 
always less than constant limit).

I forgot that parquet-cpp has `current_encoder` as either value encoder or 
dictionary encoder 
(https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_writer.cc#L544),
 but in parquet-rs we have them separate.

So the fix could be something like this:
{code}
/// Returns true if there is enough data for a data page, false otherwise.
#[inline]
fn should_add_data_page(&self) -> bool {
  match self.dict_encoder {
    Some(ref encoder) => {
      encoder.estimated_data_encoded_size() >= self.props.data_pagesize_limit()
    },
    None => {
      self.encoder.estimated_data_encoded_size() >= 
self.props.data_pagesize_limit()
    }
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to