These encodings are not available for use in the Parquet C++ library yet -- partially implemented but not thoroughly tested or exposed in the public API -- so it's not possible to generate them from Python. I don't know about Java, you may want to ask on the Parquet mailing list
On Mon, Mar 23, 2020 at 2:30 AM Omega Gamage <[email protected]> wrote: > > I was trying to write a parquet file with delta encoding. This page > <https://github.com/apache/parquet-format/blob/master/Encodings.md>, states > that parquet supports three types of delta encodings: > > (DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY). > > Since spark, pyspark or pyarrow does not allow us to specify the encoding > method. I was curious how one can write a file with delta encoding enabled? > > However, I found on the internet that, if I have columns with TimeStamp > type parquet will use delta encoding. So I used the following code in > *Scala* to create a parquet file. But encoding is not a delta. > > > val df = Seq(("2018-05-01"), > ("2018-05-02"), > ("2018-05-03"), > ("2018-05-04"), > ("2018-05-05"), > ("2018-05-06"), > ("2018-05-07"), > ("2018-05-08"), > ("2018-05-09"), > ("2018-05-10") > ).toDF("Id") > val df2 = df.withColumn("Timestamp", (col("Id").cast("timestamp"))) > val df3 = df2.withColumn("Date", (col("Id").cast("date"))) > > df3.coalesce(1).write.format("parquet").mode("append").save("date_time2") > > parquet-tools shows the following information regarding the written parquet > file. > > file schema: spark_schema > --------------------------------------------------------------------------------Id: > OPTIONAL BINARY L:STRING R:0 D:1Timestamp: OPTIONAL INT96 > R:0 D:1Date: OPTIONAL INT32 L:DATE R:0 D:1 > > row group 1: RC:31 TS:1100 OFFSET:4 > --------------------------------------------------------------------------------Id: > BINARY SNAPPY DO:0 FPO:4 SZ:230/487/2.12 VC:31 > ENC:RLE,PLAIN,BIT_PACKED ST:[min: 2018-05-01, max: 2018-05-31, > num_nulls: 0]Timestamp: INT96 SNAPPY DO:0 FPO:234 SZ:212/436/2.06 > VC:31 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[num_nulls: 0, min/max > not defined]Date: INT32 SNAPPY DO:0 FPO:446 SZ:181/177/0.98 > VC:31 ENC:RLE,PLAIN,BIT_PACKED ST:[min: 2018-05-01, max: 2018-05-31, > num_nulls: 0] > > As you can see, no column has used delta encoding. > > My question is, > > 1) How can I write a parquet file with delta encoding? (If you can provide > an example code in scala or python that would be great.) 2) How to decide > which "delta encoding": (DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, > DELTA_BYTE_ARRAY) to use?
