[
https://issues.apache.org/jira/browse/IMPALA-14189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Csaba Ringhofer updated IMPALA-14189:
-------------------------------------
Description:
Currently Impala doesn't delete files in sub directories while Hive does,
though both Hive and Impala do recursive listing by default in external tables
(can be disabled with
impala.disable.recursive.listing).
insert overwrite: deletes subdirectories for partitioned tables, do not delete
for non-partitioned tables
truncate: never deletes subdirectories
Example:
{code}
show files in texternal; -- return a single file in a subdirectory (nested_dir)
-> hdfs://localhost:20500/test-warehouse/texternal/nested_dir/a.txt
truncate texternal;
show files in texternal; --returns the same result
-> hdfs://localhost:20500/test-warehouse/texternal/nested_dir/a.txt
insert overwrite texternal select * from texternal;
show files in texternal; -- the file in the subdir is still kept after insert
overwrite
hdfs://localhost:20500/test-warehouse/texternal/f549975b8cf16b86-19a0de0d00000000_1586861351_data.0.txt
hdfs://localhost:20500/test-warehouse/texternal/nested_dir/a.txt
{code}
Hive deletes sub directories both during truncate and insert overwrite
(probably skips hidden folders, didn't check)
I think that the correct solution would be to always delete the files that are
considered part of the table.
was:
Currently Impala doesn't delete files in sub directories while Hive does,
though both Hive and Impala do recursive listing by default in external tables
(can be disabled with
impala.disable.recursive.listing).
Example:
{code}
show files in texternal; -- return a single file in a subdirectory (nested_dir)
-> hdfs://localhost:20500/test-warehouse/texternal/nested_dir/a.txt
truncate texternal;
show files in texternal; --returns the same result
-> hdfs://localhost:20500/test-warehouse/texternal/nested_dir/a.txt
insert overwrite texternal select * from texternal;
show files in texternal; -- the file in the subdir is still kept after insert
overwrite
hdfs://localhost:20500/test-warehouse/texternal/f549975b8cf16b86-19a0de0d00000000_1586861351_data.0.txt
hdfs://localhost:20500/test-warehouse/texternal/nested_dir/a.txt
{code}
Hive deletes sub directories both during truncate and insert overwrite
(probably skips hidden folders, didn't check)
I think that the correct solution would be to always delete the files that are
considered part of the table.
Note that
> Cleanup subdirectories in truncate/insert overwrite if recursing listing is
> enabled
> -----------------------------------------------------------------------------------
>
> Key: IMPALA-14189
> URL: https://issues.apache.org/jira/browse/IMPALA-14189
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog
> Reporter: Csaba Ringhofer
> Priority: Critical
>
> Currently Impala doesn't delete files in sub directories while Hive does,
> though both Hive and Impala do recursive listing by default in external
> tables (can be disabled with
> impala.disable.recursive.listing).
> insert overwrite: deletes subdirectories for partitioned tables, do not
> delete for non-partitioned tables
> truncate: never deletes subdirectories
> Example:
> {code}
> show files in texternal; -- return a single file in a subdirectory
> (nested_dir)
> -> hdfs://localhost:20500/test-warehouse/texternal/nested_dir/a.txt
> truncate texternal;
> show files in texternal; --returns the same result
> -> hdfs://localhost:20500/test-warehouse/texternal/nested_dir/a.txt
> insert overwrite texternal select * from texternal;
> show files in texternal; -- the file in the subdir is still kept after insert
> overwrite
>
> hdfs://localhost:20500/test-warehouse/texternal/f549975b8cf16b86-19a0de0d00000000_1586861351_data.0.txt
>
> hdfs://localhost:20500/test-warehouse/texternal/nested_dir/a.txt
> {code}
> Hive deletes sub directories both during truncate and insert overwrite
> (probably skips hidden folders, didn't check)
> I think that the correct solution would be to always delete the files that
> are considered part of the table.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]