Ivan Sadikov created ARROW-5129: ----------------------------------- Summary: Column writer bug: check dictionary encoder when adding a new data page Key: ARROW-5129 URL: https://issues.apache.org/jira/browse/ARROW-5129 Project: Apache Arrow Issue Type: Bug Components: Rust Environment: N/A Reporter: Ivan Sadikov
As part of my weekly routine, I glanced over code in Parquet column writer and found that the way we check when to add a new data page is buggy. The idea is checking the current encoder and deciding if we have written enough bytes for a page to construct. The problem is that we only check value encoder, regardless whether or not dictionary encoder is enabled. Here is how we do it now: actual check (https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L378) and the buggy function (https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L423). In the case of sparse column and dictionary encoder we would write a single data page, even though we would have accumulated a large enough number of bytes for more than one page in encoder (value encoder will be empty, so it will always less than constant limit). I forgot that parquet-cpp has `current_encoder` as either value encoder or dictionary encoder (https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_writer.cc#L544), but in parquet-rs we have them separate. So the fix could be something like this: {code} /// Returns true if there is enough data for a data page, false otherwise. #[inline] fn should_add_data_page(&self) -> bool { match self.dict_encoder { Some(ref encoder) => { encoder.estimated_data_encoded_size() >= self.props.data_pagesize_limit() }, None => { self.encoder.estimated_data_encoded_size() >= self.props.data_pagesize_limit() } } } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)