Hello all, I'm running a from-scratch cluster on AWS EC2. I have an external table (partitioned) defined with data on S3. I'm able to query this table and receive results to the console with a simple select * statement:
-------------------------------------------------------------------------------------------------------- hive> set hive.execution.engine=tez; hive> select * from external_table where partition_1='1' and partition_2='2'; [correct results returned] -------------------------------------------------------------------------------------------------------- Running a query that requires Tez doesn't return the results to the console: -------------------------------------------------------------------------------------------------------- hive> set hive.execution.engine=tez; hive> select count(*) from external_table where partition_1='1' and partition_2='2'; Status: Running (Executing on YARN cluster with App id application_1572972524483_0012) OK +------+ | _c0 | +------+ +------+ No rows selected (8.902 seconds) -------------------------------------------------------------------------------------------------------- However, if I dig in the logs and on the filesystem, I can find the results from that query: -------------------------------------------------------------------------------------------------------- (yarn.resourcemanager.log) org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1572972524483_0022 CONTAINERID=container_1572972524483_0022_01_000002 RESOURCE=<memory:1024, vCores:1> QUEUENAME=default (container_folder/syslog_attempt) [TezChild] |exec.FileSinkOperator|: New Final Path: FS file:/tmp/[REALLY LONG FILE PATH]/000000_0 [root #] cat /tmp/[REALLY LONG FILE PATH]/000000_0 SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Textl▒ꩇ1som}▒▒j¹▒ 2060 -------------------------------------------------------------------------------------------------------- 2060 is the correct count for the partition. Now, oddly enough, I'm able to get the results from the application if I insert overwrite directory on HDFS: -------------------------------------------------------------------------------------------------------- hive> set hive.execution.engine=tez; hive> INSERT OVERWRITE DIRECTORY '/tmp/local_out' select count(*) from external_table where partition_1='1' and partition_2='2'; [root #] hdfs dfs -cat /tmp/local_out/000000_0 2060 -------------------------------------------------------------------------------------------------------- However, attempting to insert overwrite local directory fails: -------------------------------------------------------------------------------------------------------- hive> set hive.execution.engine=tez; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' select count(*) from external_table where partition_1='1' and partition_2='2'; [root #] cat /tmp/local_out/000000_0 cat: /tmp/local_out/000000_0: No such file or directory -------------------------------------------------------------------------------------------------------- If I cat the container result file for this query, it's only the number, no class name or special characters: -------------------------------------------------------------------------------------------------------- [root #] cat /tmp/[REALLY LONG FILE PATH]/000000_0 2060 -------------------------------------------------------------------------------------------------------- The only out-of-place log message I can find comes from the YARN ResourceManager log: -------------------------------------------------------------------------------------------------------- (yarn.resourcemanager.log) INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1572972524483_0023 CONTAINERID=container_1572972524483_0023_01_000004 RESOURCE=<memory:1024, vCores:1> QUEUENAME=default (yarn.resourcemanager.log) WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root IP=NMIP OPERATION=AM Released Container TARGET=Scheduler RESULT=FAILURE DESCRIPTION=Trying to release container not owned by app or with invalid id. PERMISSIONS=Unauthorized access or invalid container APPID=application_1572972524483_0023 CONTAINERID=container_1572972524483_0023_01_000004 -------------------------------------------------------------------------------------------------------- I've also tried creating a table and inserting data into it. The table creates just fine but when I tried to insert data, it throws an error: -------------------------------------------------------------------------------------------------------- hive> set hive.execution.engine=tez; hive> insert into test_table (test_col) values ('blah'), ('blahblah'); Query ID = root_20191106172949_5301b127-7219-46d1-8fd2-dc80ca7e96ee Total jobs = 1 Launching Job 1 out of 1 Status: Failed Vertex failed, vertexName=Map 1, vertexId=vertex_1573060958692_0001_1_00, diagnostics=[Vertex vertex_1573060958692_0001_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: _dummy_table initializer failed, vertex=vertex_1573060958692_0001_1_00 [Map 1], org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/tmp/root/a9b76683-8e19-446a-be74-7a5daedf70e5/hive_2019-11-06_17-29-49_820_224977921325223208-2/dummy_path at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:332) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274) at org.apache.hadoop.hive.shims.Hadoop23Shims$1.listStatus(Hadoop23Shims.java:134) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217) at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:321) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:444) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:564) at org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateOldSplits(MRInputHelpers.java:488) at org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateInputSplitsToMem(MRInputHelpers.java:337) at org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:122) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) -------------------------------------------------------------------------------------------------------- My versions are as follows: Hadoop 3.2.1 Hive 3.1.2 Tez 0.9.2 Any help is much appreciated!