hello Hive developer/users, We have a SOC (security ops center) that uses Hadoop (3.3.1) and Hive (3.1.2) community versions on ubuntu 20.04. We use hive external tables with ORC files stored in HDFS. We want to use the columnar encryption feature available in ORC 1.6. (https://issues.apache.org/jira/browse/ORC-14) I figured out that as of today, Hive 3.1.3 ships ORC 1.5.8, and Hive 4.0.0-alpha2 ships ORC 1.6.9. The ORC files created by hive external tables are in 0.12 and ORC_517
$ hive --orcfiledump <whateverpath>/000000_0 .. File Version: 0.12 with ORC_517 I am trying Hive 4.0.0-alpha2 in a staging environment to test column encryption using Hadoop KMS. Used these commands, but the data is still in plain text. I created a db siem_p_csv having tables in plain text files (csv), and another db siem_e_orc db to have tables in encrypted form (orc). 0: jdbc:hive2://> create external table siem_p_csv.mytable (a BIGINT, b STRING, c INT) PARTITIONED BY (year INT, month INT, day INT, hour INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/test/mytable/p/csv'; 0: jdbc:hive2://> ALTER TABLE siem_p_csv.mytable ADD IF NOT EXISTS PARTITION (year=2023, month=5, day=5, hour=5) LOCATION '/test/mytable/p/csv/year=2023/month=5/day=5/hour=5'; 0: jdbc:hive2://> create external table siem_e_orc.mytable (a BIGINT, b STRING, c INT) PARTITIONED BY (year INT, month INT, day INT, hour INT) STORED AS ORC LOCATION '/test/mytable/e/orc' TBLPROPERTIES('hadoop.security.key.provider.path' = "kms://http@hadoop-masternn0a:9600/kms", 'orc.key.provider' = "hadoop", 'orc.encrypt.pii' = "b"); 0: jdbc:hive2://> SET hive.exec.dynamic.partition=true; 0: jdbc:hive2://> SET hive.exec.dynamic.partition.mode=nonstrict; 0: jdbc:hive2://> INSERT OVERWRITE TABLE siem_e_orc.mytable PARTITION (year, month, day, hour) SELECT a, b, c, year, month, day, hour FROM siem_p_csv.mytable; 0: jdbc:hive2://> select * from siem_p_csv.mytable; +-------------+------------+------------+---------------+----------------+--------------+---------------+ | mytable.a | mytable.b | mytable.c | mytable.year | mytable.month | mytable.day | mytable.hour | +-------------+------------+------------+---------------+----------------+--------------+---------------+ | 1683393894 | secret | 50505 | 2023 | 5 | 5 | 5 | +-------------+------------+------------+---------------+----------------+--------------+---------------+ 1 row selected (0.15 seconds) 0: jdbc:hive2://> select * from siem_e_orc.mytable; +-------------+------------+------------+---------------+----------------+--------------+---------------+ | mytable.a | mytable.b | mytable.c | mytable.year | mytable.month | mytable.day | mytable.hour | +-------------+------------+------------+---------------+----------------+--------------+---------------+ | 1683393894 | secret | 50505 | 2023 | 5 | 5 | 5 | +-------------+------------+------------+---------------+----------------+--------------+---------------+ 0: jdbc:hive2://> exit $ hive --orcfiledump /test/mytable/e/orc/year=2023/month=5/day=5/hour=5/000000_0 . . Processing data file /test/mytable/e/orc/year=2023/month=5/day=5/hour=5/000000_0 [length: 407] Structure for /test/mytable/e/orc/year=2023/month=5/day=5/hour=5/000000_0 File Version: 0.12 with ORC_14 Rows: 1 Compression: ZLIB Compression size: 262144 Calendar: Julian/Gregorian Type: struct<a:bigint,b:string,c:int> Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 9 min: 1683393894 max: 1683393894 sum: 1683393894 Column 2: count: 1 hasNull: false bytesOnDisk: 15 min: secret max: secret sum: 6 Column 3: count: 1 hasNull: false bytesOnDisk: 8 min: 50505 max: 50505 sum: 50505 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 9 min: 1683393894 max: 1683393894 sum: 1683393894 Column 2: count: 1 hasNull: false bytesOnDisk: 15 min: secret max: secret sum: 6 Column 3: count: 1 hasNull: false bytesOnDisk: 8 min: 50505 max: 50505 sum: 50505 Stripes: Stripe: offset: 3 data: 32 rows: 1 tail: 54 index: 103 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 31 Stream: column 2 section ROW_INDEX start: 45 length 32 Stream: column 3 section ROW_INDEX start: 77 length 29 Stream: column 1 section DATA start: 106 length 9 Stream: column 2 section DATA start: 115 length 9 Stream: column 2 section LENGTH start: 124 length 6 Stream: column 3 section DATA start: 130 length 8 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DIRECT_V2 Encoding column 3: DIRECT_V2 File length: 407 bytes Padding length: 0 bytes Padding ratio: 0% _____________________________________________________________________________________ I still see that the ORC file generated by 4.0.0-alpha2 also has plain text in column "b". Am I missing something? I appreciate any help you can give. best regards -- Maruthi