[
https://issues.apache.org/jira/browse/IMPALA-13667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
woosuk.ro updated IMPALA-13667:
-------------------------------
Description:
h3. *Description*
When using Impala with Ranger for data masking, applying a {{mask_hash}} policy
to columns in both tables and views results in the {{mask_hash}} function being
nested multiple times. This behavior leads to redundant hashing operations. Is
this intended behavior?
h3. *Steps to Reproduce*
# *Apply Masking Policies:*
*
** Apply a {{mask_hash}} policy to a specific column (e.g.,
{{{}account_number{}}}) across all tables in two databases, {{temp_db}} and
{{{}private_db{}}}.
# *Create a Base Table:*
{code:java}
CREATE TABLE private_db.base_table (
account_number STRING,
other_column STRING
);{code}
# *Create a View Referencing the Base Table:*
{code:java}
CREATE VIEW private_db.base_view AS
SELECT * FROM private_db.base_table;{code}
# *Create Another View Referencing the First View:*
{code:java}
CREATE VIEW temp_db.secondary_view AS
SELECT * FROM private_db.base_view;{code}
# *Execute a Query on the Second View:*
{code:java}
SELECT * FROM temp_db.secondary_view;{code}
h3. *Expected Behavior*
The {{mask_hash}} function should be applied *once* to the {{account_number}}
column, regardless of the number of view layers referencing the masked table or
view.
----
h3. *Actual Behavior*
The {{mask_hash}} function is applied *three times* to the {{account_number}}
column due to nested view references. This results in multiple layers of
hashing, as observed in both the query execution plan and Ranger audit logs.
*Example Query Execution Plan:*
{code:java}
WARNING: The following tables are missing relevant table and/or column
statistics.
private_db.base_table
Analyzed query: SELECT * FROM (SELECT mask_hash(account_number) account_number,
my_account_number FROM
temp_db.secondary_view)F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB
thread-reservation=1
PLAN-ROOT SINK output exprs:
*mask_hash(mask_hash(mask_hash(account_number)))*, my_account_number
mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB
thread-reservation=0 {code}
*Ranger Audit Logs:*
# *temp_db.secondary_view account_number column masking*
{code:java}
{
"access": "mask_hash",
"resource": "temp_db/secondary_view/account_number",
"resType": "@column",
"reqData": "SELECT * FROM temp_db.secondary_view"
}{code}
# *private_db.base_view account_number column masking*
{code:java}
{
"access": "mask_hash",
"resource": "private_db/base_view/account_number",
"resType": "@column",
"reqData": "SELECT * FROM temp_db.secondary_view"
}{code}
# *private_db.base_table account_number column masking*
{code:java}
{
"access": "mask_hash",
"resource": "private_db/base_table/account_number",
"resType": "@column",
"reqData": "SELECT * FROM temp_db.secondary_view"
}{code}
*Environment*
- Impala: 4.4.0
- Ranger: 2.3.0
was:
h3. *Description*
When using Impala with Ranger for data masking, applying a {{mask_hash}} policy
to columns in both tables and views results in the {{mask_hash}} function being
nested multiple times. This behavior leads to redundant hashing operations. Is
this intended behavior?
h3. *Steps to Reproduce*
# *Apply Masking Policies:*
*
** Apply a {{mask_hash}} policy to a specific column (e.g.,
{{{}account_number{}}}) across all tables in two databases, {{temp_db}} and
{{{}private_db{}}}.
# *Create a Base Table:*
{code:java}
CREATE TABLE private_db.base_table (
account_number STRING,
other_column STRING
);{code}
# *Create a View Referencing the Base Table:*
{code:java}
CREATE VIEW private_db.base_view AS
SELECT * FROM private_db.base_table;{code}
# *Create Another View Referencing the First View:*
{code:java}
CREATE VIEW temp_db.secondary_view AS
SELECT * FROM private_db.base_view;{code}
# *Execute a Query on the Second View:*
{code:java}
SELECT * FROM temp_db.secondary_view;{code}
h3. *Expected Behavior*
The {{mask_hash}} function should be applied *once* to the {{account_number}}
column, regardless of the number of view layers referencing the masked table or
view.
----
h3. *Actual Behavior*
The {{mask_hash}} function is applied *three times* to the {{account_number}}
column due to nested view references. This results in multiple layers of
hashing, as observed in both the query execution plan and Ranger audit logs.
*Example Query Execution Plan:*
{code:java}
WARNING: The following tables are missing relevant table and/or column
statistics.
private_db.base_table
Analyzed query: SELECT * FROM (SELECT mask_hash(account_number) account_number,
my_account_number FROM
temp_db.secondary_view)F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB
thread-reservation=1
PLAN-ROOT SINK output exprs:
*mask_hash(mask_hash(mask_hash(account_number)))*, my_account_number
mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB
thread-reservation=0 {code}
*Ranger Audit Logs:*
# *temp_db.secondary_view account_number column masking*
{code:java}
{
"access": "mask_hash",
"resource": "temp_db/secondary_view/account_number",
"resType": "@column",
"reqData": "SELECT * FROM temp_db.secondary_view"
}{code}
# *private_db.base_view account_number column masking*
{code:java}
{
"access": "mask_hash",
"resource": "private_db/base_view/account_number",
"resType": "@column",
"reqData": "SELECT * FROM temp_db.secondary_view"
}{code}
# *private_db.base_table account_number column masking*
{code:java}
{
"access": "mask_hash",
"resource": "private_db/base_table/account_number",
"resType": "@column",
"reqData": "SELECT * FROM temp_db.secondary_view"
}{code}
*Environment*
- Impala: 4.4.0
- Ranger: 2.3.0
> Unexpected Nested mask_hash Functions When Using Views in Impala with Ranger
> ----------------------------------------------------------------------------
>
> Key: IMPALA-13667
> URL: https://issues.apache.org/jira/browse/IMPALA-13667
> Project: IMPALA
> Issue Type: Question
> Components: Frontend
> Reporter: woosuk.ro
> Priority: Minor
>
> h3. *Description*
> When using Impala with Ranger for data masking, applying a {{mask_hash}}
> policy to columns in both tables and views results in the {{mask_hash}}
> function being nested multiple times. This behavior leads to redundant
> hashing operations. Is this intended behavior?
> h3. *Steps to Reproduce*
> # *Apply Masking Policies:*
> *
> ** Apply a {{mask_hash}} policy to a specific column (e.g.,
> {{{}account_number{}}}) across all tables in two databases, {{temp_db}} and
> {{{}private_db{}}}.
> # *Create a Base Table:*
> {code:java}
> CREATE TABLE private_db.base_table (
> account_number STRING,
> other_column STRING
> );{code}
> # *Create a View Referencing the Base Table:*
> {code:java}
> CREATE VIEW private_db.base_view AS
> SELECT * FROM private_db.base_table;{code}
>
> # *Create Another View Referencing the First View:*
> {code:java}
> CREATE VIEW temp_db.secondary_view AS
> SELECT * FROM private_db.base_view;{code}
>
> # *Execute a Query on the Second View:*
> {code:java}
> SELECT * FROM temp_db.secondary_view;{code}
> h3. *Expected Behavior*
> The {{mask_hash}} function should be applied *once* to the {{account_number}}
> column, regardless of the number of view layers referencing the masked table
> or view.
> ----
> h3. *Actual Behavior*
> The {{mask_hash}} function is applied *three times* to the {{account_number}}
> column due to nested view references. This results in multiple layers of
> hashing, as observed in both the query execution plan and Ranger audit logs.
> *Example Query Execution Plan:*
> {code:java}
> WARNING: The following tables are missing relevant table and/or column
> statistics.
> private_db.base_table
> Analyzed query: SELECT * FROM (SELECT mask_hash(account_number)
> account_number, my_account_number FROM
> temp_db.secondary_view)F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB
> thread-reservation=1
> PLAN-ROOT SINK output exprs:
> *mask_hash(mask_hash(mask_hash(account_number)))*, my_account_number
> mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB
> thread-reservation=0 {code}
> *Ranger Audit Logs:*
> # *temp_db.secondary_view account_number column masking*
> {code:java}
> {
> "access": "mask_hash",
> "resource": "temp_db/secondary_view/account_number",
> "resType": "@column",
> "reqData": "SELECT * FROM temp_db.secondary_view"
> }{code}
>
> # *private_db.base_view account_number column masking*
> {code:java}
> {
> "access": "mask_hash",
> "resource": "private_db/base_view/account_number",
> "resType": "@column",
> "reqData": "SELECT * FROM temp_db.secondary_view"
> }{code}
> # *private_db.base_table account_number column masking*
> {code:java}
> {
> "access": "mask_hash",
> "resource": "private_db/base_table/account_number",
> "resType": "@column",
> "reqData": "SELECT * FROM temp_db.secondary_view"
> }{code}
> *Environment*
> - Impala: 4.4.0
> - Ranger: 2.3.0
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]