[ https://issues.apache.org/jira/browse/HIVE-21164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059280#comment-17059280 ]
Sungwoo commented on HIVE-21164: -------------------------------- I have tested this commit (in the master branch) with TPC-DS datasets, and find that it fails to create ORC tables from text tables. Our testing proceeds as follows. All the datasets are created on S3 simulated with MinIO, not on HDFS. 1. Generate a TPC-DS dataset in text format. 2. Create an external table. Here is an example of the command (where some lines are omitted): {code:sql} create external table store_sales ( ss_sold_date_sk bigint, ... ss_net_profit double ) row format delimited fields terminated by '|' location 's3a://tmp/tpcds-generate/2/store_sales'; {code} 3. Create a managed ORC table. Here is an example of the command (where some lines are omitted): {code:sql} create table store_sales ( ss_sold_time_sk bigint, ... ss_net_profit double ) partitioned by (ss_sold_date_sk bigint) stored as orc TBLPROPERTIES('transactional'='true', 'transactional_properties'='default'); from tpcds_text_2.store_sales ss insert overwrite table store_sales partition (ss_sold_date_sk) select ss.ss_sold_time_sk, ... ss.ss_net_profit, ss.ss_sold_date_sk where ss.ss_sold_date_sk is not null insert overwrite table store_sales partition (ss_sold_date_sk) select ss.ss_sold_time_sk, ... ss.ss_net_profit, ss.ss_sold_date_sk where ss.ss_sold_date_sk is null sort by ss.ss_sold_date_sk ; {code} 3. The result is that a new ORC table store_sales is created, but it contains no row. {code:sh} 0: jdbc:hive2://orange1:9852/> select count(*) from store_sales; ... +----------+ | _c0 | +----------+ | 5501397 | +----------+ 1 row selected (0.766 seconds) 0: jdbc:hive2://orange1:9852/> select * from store_sales limit 100; ... No rows selected (94.835 seconds) {code} The directory for the new table contains subdirectories as expected, but contains no contents, e.g.: {code:none} store_sales/ss_sold_date_sk=2451465 store_sales/ss_sold_date_sk=2451465/base_0000001 store_sales/ss_sold_date_sk=2451465/base_0000001/_orc_acid_version store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/base_0000001 store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/base_0000001/_orc_acid_version {code} 4. What happens during the creating of the ORC table is: 1) The query completes successfully. 2) All intermediate data is created (over 300 megabytes), but deleted at the end. The relevant parts of the HiveServer2 log look like: {code:none} 2020-03-13T20:21:12,669 INFO [HiveServer2-Background-Pool: Thread-259] FileOperations: Reading manifest s3a://hivemr3/warehouse/tpcds_bin_partitioned_orc_2.db/store_sales/_tmp.base_0000001/000013_0.manifest 2020-03-13T20:21:13,062 INFO [HiveServer2-Background-Pool: Thread-259] FileOperations: Expected level of nesting (2) is not present in _tmp.base_0000001/000000_0.manifest (from s3a://hivemr3/warehouse/tpcds_bin_partitioned_orc_2.db/store_sales/_tmp.base_0000001/000000_0.manifest) ... 2020-03-13T20:21:13,233 INFO [HiveServer2-Background-Pool: Thread-259] FileOperations: Deleting manifest directory s3a://hivemr3/warehouse/tpcds_bin_partitioned_orc_2.db/store_sales/_tmp.base_0000001 2020-03-13T20:21:13,438 INFO [HiveServer2-Background-Pool: Thread-259] FileOperations: Deleting s3a://hivemr3/warehouse/tpcds_bin_partitioned_orc_2.db/store_sales/ss_sold_date_sk=2451257/base_0000001/bucket_00003_0 that was not committed {code} >From the log file, all the bucket files are deleted at the end. This behavior is first introduced in this commit because the previous commit in the master branch of Hive, namely 'HIVE-22816 : QueryCache: Queries using views can have them cached after CTE expansion' of February 20, does not show the same behavior and produces the correct result. If you think I am missing something in the setup of the experiment, please let me know. If anybody updates the patch, I can quickly run the experiment and report the result back. Unfortunately I have not figured out why this commit produces a different result. > ACID: explore how we can avoid a move step during inserts/compaction > -------------------------------------------------------------------- > > Key: HIVE-21164 > URL: https://issues.apache.org/jira/browse/HIVE-21164 > Project: Hive > Issue Type: Bug > Components: Transactions > Affects Versions: 3.1.1 > Reporter: Vaibhav Gumashta > Assignee: Marta Kuczora > Priority: Major > Fix For: 4.0.0 > > Attachments: HIVE-21164.1.patch, HIVE-21164.10.patch, > HIVE-21164.11.patch, HIVE-21164.11.patch, HIVE-21164.12.patch, > HIVE-21164.13.patch, HIVE-21164.14.patch, HIVE-21164.14.patch, > HIVE-21164.15.patch, HIVE-21164.16.patch, HIVE-21164.17.patch, > HIVE-21164.18.patch, HIVE-21164.19.patch, HIVE-21164.2.patch, > HIVE-21164.20.patch, HIVE-21164.21.patch, HIVE-21164.22.patch, > HIVE-21164.3.patch, HIVE-21164.4.patch, HIVE-21164.5.patch, > HIVE-21164.6.patch, HIVE-21164.7.patch, HIVE-21164.8.patch, HIVE-21164.9.patch > > > Currently, we write compacted data to a temporary location and then move the > files to a final location, which is an expensive operation on some cloud file > systems. Since HIVE-20823 is already in, it can control the visibility of > compacted data for the readers. Therefore, we can perhaps avoid writing data > to a temporary location and directly write compacted data to the intended > final path. -- This message was sent by Atlassian Jira (v8.3.4#803005)