Gabriella Gyorgyevics has uploaded a new patch set (#36). ( http://gerrit.cloudera.org:8080/22165 )
Change subject: IMPALA-13648: Implement a decoder and an encoder for Byte Stream Split encoding ...................................................................... IMPALA-13648: Implement a decoder and an encoder for Byte Stream Split encoding The decoder can read one, or multiple values at a time from the given buffer. When reading multiple values at a time, they could be read with a stride. The encoder adds values one by one, until there are no more values to add, or the output given couldn't fit any more. The encoding happens upon calling `FinalizePage()` Both the encoder and decoder can be used with either a template size_t value, or a value given in the constructor. This value is the size of the type to be coded in bytes. * The template option is more optimized, but it only supports 4 and 8 byte types. * The constructor option is less optimized, but it can recieve any number as the byte size. To use the constructor passed number, set the number passed in the template to 0, otherwise pass the number of bytes in the template. Note, that neither the encoder, nor the decoder are integrated with Impala yet, so reading or writing data with byte stream split encoding is not yet possible. -------------------------------- Tests --------------------------------- Created decoder tests for * basic functionality, * decoding values one by one * decoding values in batch * decoding values combining the previous two * the stride feature * skipping a number of values Created encoder tests for * basic functionality * putting values in one by one * finalizing the page Created two-way tests for the following cases: * encoding then decoding one by one * encoding then decoding in batch * encoding then decoding with stride * decoding one by one then encoding * decoding in batch then encoding * decoding with stride then encoding Each of these tests is run on a data set of up to 200 values. These tests are run on every supported type. ------------------------------ Benchmarks ------------------------------ For now, we have benchmarks for the following comparisons: * byte stream split functionality: * Compile VS Runtime initialized decoder * Float VS Int VS Double VS Long VS 6 and 11 byte size types * Repeating VS Sequential VS Random ordered data * Decoding one by one VS in batch VS with stride (!= byte_size) * Small VS Medium (10x small) VS Large (100x small) stride * byte stream split VS dictionary decoder * on same value repeating dataset * on few values repeating (shuffled) dataset * on sequential dataset * on random dataset * in batch * one by one * with stride (small, medium and large) Planning to add: * Decoding with byte-stream-split VS delta * Decoding in batch * Read 10 values then skip 20 * Different size strides * Different data types (Repeating, Sequential, Random) * Similar benchmarks for the encoder * Storage size comparison after encoding and compression Change-Id: Icea60894ae22b8ddb7616aeda6d69012cc69972c --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/parquet-byte-stream-split-benchmark.cc M be/src/exec/parquet/CMakeLists.txt A be/src/exec/parquet/parquet-byte-stream-split-coder-test-data.h A be/src/exec/parquet/parquet-byte-stream-split-decoder.cc A be/src/exec/parquet/parquet-byte-stream-split-decoder.h A be/src/exec/parquet/parquet-byte-stream-split-encoder.cc A be/src/exec/parquet/parquet-byte-stream-split-encoder.h A be/src/exec/parquet/parquet-byte-stream-split-test.cc 9 files changed, 4,379 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/22165/36 -- To view, visit http://gerrit.cloudera.org:8080/22165 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icea60894ae22b8ddb7616aeda6d69012cc69972c Gerrit-Change-Number: 22165 Gerrit-PatchSet: 36 Gerrit-Owner: Gabriella Gyorgyevics <ggyorgyev...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Gabriella Gyorgyevics <ggyorgyev...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com>