Re: Hive Not Returning YARN Application Results Correctly Nor Inserting Into Local Tables

Sungwoo Park Wed, 06 Nov 2019 17:59:50 -0800

For the problem of not returning the result to the console, I think it
occurs because the default file system is set to local file system, not to
HDFS. Perhaps hive.exec.scratchdir is already set to /tmp/hive, but if the
default file system is local, FileSinkOperator writes the final result to
the local file system of the container where it is running. Then
HiveServer2 tries to read from a subdirectory under /tmp/hive of its own
local file system, thus returning an empty result. (The query 'select *
from ...' works okay because it is taken care of by HiveServer2 itself.)


I can think of two solutions: 1) set the default file system to HDFS (e.g.,
by updating core-site.xml); 2) embed the file system directly into
hive.exec.scratchdir (e.g., by setting it to hdfs://tmp/hive).

--- gla

On Thu, Nov 7, 2019 at 3:12 AM Aaron Grubb <aaron.gr...@clearpier.com>
wrote:

> Hello all,
>
>
>
> I'm running a from-scratch cluster on AWS EC2. I have an external table
> (partitioned) defined with data on S3. I'm able to query this table and
> receive results to the console with a simple select * statement:
>
>
>
>
> --------------------------------------------------------------------------------------------------------
>
> hive> set hive.execution.engine=tez;
>
> hive> select * from external_table where partition_1='1' and
> partition_2='2';
>
> [correct results returned]
>
>
> --------------------------------------------------------------------------------------------------------
>
>
>
> Running a query that requires Tez doesn't return the results to the
> console:
>
>
>
>
> --------------------------------------------------------------------------------------------------------
>
> hive> set hive.execution.engine=tez;
>
> hive> select count(*) from external_table where partition_1='1' and
> partition_2='2';
>
> Status: Running (Executing on YARN cluster with App id
> application_1572972524483_0012)
>
>
>
> OK
>
> +------+
>
> | _c0 |
>
> +------+
>
> +------+
>
> No rows selected (8.902 seconds)
>
>
> --------------------------------------------------------------------------------------------------------
>
>
>
> However, if I dig in the logs and on the filesystem, I can find the
> results from that query:
>
>
>
>
> --------------------------------------------------------------------------------------------------------
>
> (yarn.resourcemanager.log)
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root
> OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
> APPID=application_1572972524483_0022
> CONTAINERID=container_1572972524483_0022_01_000002 RESOURCE=<memory:1024,
> vCores:1> QUEUENAME=default
>
> (container_folder/syslog_attempt) [TezChild] |exec.FileSinkOperator|: New
> Final Path: FS file:/tmp/[REALLY LONG FILE PATH]/000000_0
>
> [root #] cat /tmp/[REALLY LONG FILE PATH]/000000_0
>
> SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Textl▒ꩇ1som}▒▒
> j¹▒ 2060
>
>
> --------------------------------------------------------------------------------------------------------
>
>
>
> 2060 is the correct count for the partition.
>
>
>
> Now, oddly enough, I'm able to get the results from the application if I
> insert overwrite directory on HDFS:
>
>
>
>
> --------------------------------------------------------------------------------------------------------
>
> hive> set hive.execution.engine=tez;
>
> hive> INSERT OVERWRITE DIRECTORY '/tmp/local_out' select count(*) from
> external_table where partition_1='1' and partition_2='2';
>
> [root #] hdfs dfs -cat /tmp/local_out/000000_0
>
> 2060
>
>
> --------------------------------------------------------------------------------------------------------
>
>
>
> However, attempting to insert overwrite local directory fails:
>
>
>
>
> --------------------------------------------------------------------------------------------------------
>
> hive> set hive.execution.engine=tez;
>
> hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' select count(*)
> from external_table where partition_1='1' and partition_2='2';
>
> [root #] cat /tmp/local_out/000000_0
>
> cat: /tmp/local_out/000000_0: No such file or directory
>
>
> --------------------------------------------------------------------------------------------------------
>
>
>
> If I cat the container result file for this query, it's only the number,
> no class name or special characters:
>
>
>
>
> --------------------------------------------------------------------------------------------------------
>
> [root #] cat /tmp/[REALLY LONG FILE PATH]/000000_0
>
> 2060
>
>
> --------------------------------------------------------------------------------------------------------
>
>
>
> The only out-of-place log message I can find comes from the YARN
> ResourceManager log:
>
>
>
>
> --------------------------------------------------------------------------------------------------------
>
> (yarn.resourcemanager.log) INFO
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root
> OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
> APPID=application_1572972524483_0023
> CONTAINERID=container_1572972524483_0023_01_000004 RESOURCE=<memory:1024,
> vCores:1> QUEUENAME=default
>
> (yarn.resourcemanager.log) WARN
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root
> IP=NMIP OPERATION=AM Released Container TARGET=Scheduler RESULT=FAILURE
> DESCRIPTION=Trying to release container not owned by app or with invalid
> id. PERMISSIONS=Unauthorized access or invalid container
> APPID=application_1572972524483_0023
> CONTAINERID=container_1572972524483_0023_01_000004
>
>
> --------------------------------------------------------------------------------------------------------
>
>
>
> I've also tried creating a table and inserting data into it. The table
> creates just fine but when I tried to insert data, it throws an error:
>
>
>
>
> --------------------------------------------------------------------------------------------------------
>
> hive> set hive.execution.engine=tez;
>
> hive> insert into test_table (test_col) values ('blah'), ('blahblah');
>
> Query ID = root_20191106172949_5301b127-7219-46d1-8fd2-dc80ca7e96ee
>
> Total jobs = 1
>
> Launching Job 1 out of 1
>
> Status: Failed
>
> Vertex failed, vertexName=Map 1, vertexId=vertex_1573060958692_0001_1_00,
> diagnostics=[Vertex vertex_1573060958692_0001_1_00 [Map 1] killed/failed
> due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: _dummy_table initializer
> failed, vertex=vertex_1573060958692_0001_1_00 [Map 1],
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
> does not exist:
> file:/tmp/root/a9b76683-8e19-446a-be74-7a5daedf70e5/hive_2019-11-06_17-29-49_820_224977921325223208-2/dummy_path
>
>         at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:332)
>
>         at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)
>
>         at
> org.apache.hadoop.hive.shims.Hadoop23Shims$1.listStatus(Hadoop23Shims.java:134)
>
>         at
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217)
>
>         at
> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76)
>
>         at
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:321)
>
>         at
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:444)
>
>         at
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:564)
>
>         at
> org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateOldSplits(MRInputHelpers.java:488)
>
>         at
> org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateInputSplitsToMem(MRInputHelpers.java:337)
>
>         at
> org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:122)
>
>         at
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
>
>         at
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
>
>         at java.security.AccessController.doPrivileged(Native Method)
>
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>
>         at
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
>
>         at
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
>
>         at
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
>
>         at
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
>
>         at
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
>         at java.lang.Thread.run(Thread.java:748)
>
>
> --------------------------------------------------------------------------------------------------------
>
>
>
> My versions are as follows:
>
>
>
> Hadoop 3.2.1
>
> Hive 3.1.2
>
> Tez 0.9.2
>
>
>
> Any help is much appreciated!
>

Re: Hive Not Returning YARN Application Results Correctly Nor Inserting Into Local Tables

Reply via email to