Gabriella Gyorgyevics has uploaded a new patch set (#36). ( 
http://gerrit.cloudera.org:8080/22165 )

Change subject: IMPALA-13648: Implement a decoder and an encoder for Byte 
Stream Split encoding
......................................................................

IMPALA-13648: Implement a decoder and an encoder for Byte Stream Split encoding

The decoder can read one, or multiple values at a time from the given
buffer. When reading multiple values at a time, they could be read with
a stride.

The encoder adds values one by one, until there are no more values to
add, or the output given couldn't fit any more. The encoding happens
upon calling `FinalizePage()`

Both the encoder and decoder can be used with either a template size_t
value, or a value given in the constructor. This value is the size of
the type to be coded in bytes.
* The template option is more optimized, but it only supports 4 and 8
byte types.
* The constructor option is less optimized, but it can recieve any
number as the byte size.
To use the constructor passed number, set the number passed in the
template to 0, otherwise pass the number of bytes in the template.

Note, that neither the encoder, nor the decoder are integrated with
Impala yet, so reading or writing data with byte stream split encoding
is not yet possible.

-------------------------------- Tests ---------------------------------

Created decoder tests for
* basic functionality,
* decoding values one by one
* decoding values in batch
* decoding values combining the previous two
* the stride feature
* skipping a number of values

Created encoder tests for
* basic functionality
* putting values in one by one
* finalizing the page

Created two-way tests for the following cases:
* encoding then decoding one by one
* encoding then decoding in batch
* encoding then decoding with stride
* decoding one by one then encoding
* decoding in batch then encoding
* decoding with stride then encoding

Each of these tests is run on a data set of up to 200
values.

These tests are run on every supported type.

------------------------------ Benchmarks ------------------------------

For now, we have benchmarks for the following comparisons:
* byte stream split functionality:
  * Compile VS Runtime initialized decoder
  * Float VS Int VS Double VS Long VS 6 and 11 byte size types
  * Repeating VS Sequential VS Random ordered data
  * Decoding one by one VS in batch VS with stride (!= byte_size)
  * Small VS Medium (10x small) VS Large (100x small) stride
* byte stream split VS dictionary decoder
  * on same value repeating dataset
  * on few values repeating (shuffled) dataset
  * on sequential dataset
  * on random dataset
  * in batch
  * one by one
  * with stride (small, medium and large)
Planning to add:
* Decoding with byte-stream-split VS delta
  * Decoding in batch
  * Read 10 values then skip 20
  * Different size strides
  * Different data types (Repeating, Sequential, Random)
* Similar benchmarks for the encoder
* Storage size comparison after encoding and compression

Change-Id: Icea60894ae22b8ddb7616aeda6d69012cc69972c
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/parquet-byte-stream-split-benchmark.cc
M be/src/exec/parquet/CMakeLists.txt
A be/src/exec/parquet/parquet-byte-stream-split-coder-test-data.h
A be/src/exec/parquet/parquet-byte-stream-split-decoder.cc
A be/src/exec/parquet/parquet-byte-stream-split-decoder.h
A be/src/exec/parquet/parquet-byte-stream-split-encoder.cc
A be/src/exec/parquet/parquet-byte-stream-split-encoder.h
A be/src/exec/parquet/parquet-byte-stream-split-test.cc
9 files changed, 4,379 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/22165/36
--
To view, visit http://gerrit.cloudera.org:8080/22165
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icea60894ae22b8ddb7616aeda6d69012cc69972c
Gerrit-Change-Number: 22165
Gerrit-PatchSet: 36
Gerrit-Owner: Gabriella Gyorgyevics <ggyorgyev...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Gabriella Gyorgyevics <ggyorgyev...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com>

Reply via email to