Hive 4.x and ORC 1.6 column encryption

Maruthi Inukonda Mon, 08 May 2023 17:45:36 -0700

hello Hive developer/users,

We have a SOC (security ops center) that uses Hadoop (3.3.1) and Hive
(3.1.2) community versions on ubuntu 20.04. We use hive external tables
with ORC files stored in HDFS. We want to use the columnar encryption
feature available in ORC 1.6. (https://issues.apache.org/jira/browse/ORC-14)
I figured out that as of today, Hive 3.1.3 ships ORC 1.5.8, and Hive
4.0.0-alpha2 ships ORC 1.6.9.  The ORC files created by hive external
tables are in 0.12 and ORC_517


$ hive --orcfiledump <whateverpath>/000000_0
..
File Version: 0.12 with ORC_517

I am trying Hive 4.0.0-alpha2 in a staging environment to test column
encryption using Hadoop KMS. Used these commands, but the data is still in
plain text.

I created a db siem_p_csv having tables in plain text files (csv), and
another db siem_e_orc db to have tables in encrypted form (orc).

0: jdbc:hive2://> create external table siem_p_csv.mytable (a BIGINT, b
STRING, c INT) PARTITIONED BY (year INT, month INT, day INT, hour INT) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE  LOCATION
'/test/mytable/p/csv';

0: jdbc:hive2://> ALTER TABLE siem_p_csv.mytable ADD IF NOT EXISTS
PARTITION (year=2023, month=5, day=5, hour=5) LOCATION
'/test/mytable/p/csv/year=2023/month=5/day=5/hour=5';

0: jdbc:hive2://> create external table siem_e_orc.mytable (a BIGINT, b
STRING, c INT) PARTITIONED BY (year INT, month INT, day INT, hour INT)
STORED AS ORC LOCATION '/test/mytable/e/orc'
TBLPROPERTIES('hadoop.security.key.provider.path' =
"kms://http@hadoop-masternn0a:9600/kms", 'orc.key.provider' = "hadoop",
'orc.encrypt.pii' = "b");

0: jdbc:hive2://> SET hive.exec.dynamic.partition=true;
0: jdbc:hive2://> SET hive.exec.dynamic.partition.mode=nonstrict;

0: jdbc:hive2://> INSERT OVERWRITE TABLE siem_e_orc.mytable PARTITION
(year, month, day, hour) SELECT a, b, c, year, month, day, hour FROM
siem_p_csv.mytable;

0: jdbc:hive2://> select * from siem_p_csv.mytable;
+-------------+------------+------------+---------------+----------------+--------------+---------------+
|  mytable.a  | mytable.b  | mytable.c  | mytable.year  | mytable.month  |
mytable.day  | mytable.hour  |
+-------------+------------+------------+---------------+----------------+--------------+---------------+
| 1683393894  | secret     | 50505      | 2023          | 5              |
5            | 5             |
+-------------+------------+------------+---------------+----------------+--------------+---------------+
1 row selected (0.15 seconds)

0: jdbc:hive2://> select * from siem_e_orc.mytable;
+-------------+------------+------------+---------------+----------------+--------------+---------------+
|  mytable.a  | mytable.b  | mytable.c  | mytable.year  | mytable.month  |
mytable.day  | mytable.hour  |
+-------------+------------+------------+---------------+----------------+--------------+---------------+
| 1683393894  | secret     | 50505      | 2023          | 5              |
5            | 5             |
+-------------+------------+------------+---------------+----------------+--------------+---------------+

0: jdbc:hive2://> exit

$ hive --orcfiledump
/test/mytable/e/orc/year=2023/month=5/day=5/hour=5/000000_0
.
.
Processing data file
/test/mytable/e/orc/year=2023/month=5/day=5/hour=5/000000_0 [length: 407]
Structure for /test/mytable/e/orc/year=2023/month=5/day=5/hour=5/000000_0
File Version: 0.12 with ORC_14

Rows: 1

Compression: ZLIB

Compression size: 262144

Calendar: Julian/Gregorian

Type: struct<a:bigint,b:string,c:int>


Stripe Statistics:
  Stripe 1:

    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 9 min: 1683393894 max:
1683393894 sum: 1683393894
    Column 2: count: 1 hasNull: false bytesOnDisk: 15 min: secret max:
secret sum: 6
    Column 3: count: 1 hasNull: false bytesOnDisk: 8 min: 50505 max: 50505
sum: 50505


File Statistics:

  Column 0: count: 1 hasNull: false

  Column 1: count: 1 hasNull: false bytesOnDisk: 9 min: 1683393894 max:
1683393894 sum: 1683393894
  Column 2: count: 1 hasNull: false bytesOnDisk: 15 min: secret max: secret
sum: 6
  Column 3: count: 1 hasNull: false bytesOnDisk: 8 min: 50505 max: 50505
sum: 50505

Stripes:

  Stripe: offset: 3 data: 32 rows: 1 tail: 54 index: 103

    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 31
    Stream: column 2 section ROW_INDEX start: 45 length 32
    Stream: column 3 section ROW_INDEX start: 77 length 29

    Stream: column 1 section DATA start: 106 length 9

    Stream: column 2 section DATA start: 115 length 9
    Stream: column 2 section LENGTH start: 124 length 6
    Stream: column 3 section DATA start: 130 length 8
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2

    Encoding column 3: DIRECT_V2

File length: 407 bytes

Padding length: 0 bytes
Padding ratio: 0%

_____________________________________________________________________________________


I still see that the ORC file generated by 4.0.0-alpha2 also has plain text
in column "b". Am I missing something? I appreciate any help you can give.

best regards
-- Maruthi

Hive 4.x and ORC 1.6 column encryption

Reply via email to