The curious case of the hive-server-2 empty partitions.

Edward Capriolo Tue, 24 Mar 2015 10:47:07 -0700

Hey all,

I have cloudera 5.3, and an issue involving HiveServer2, Hive.


We have a process that launches Hive JDBC queries, hourly. This process
selects from one table and builds another.

It looks something like this (slightly obfuscated query)

     FROM beacon INSERT OVERWRITE TABLE author_article_hourly PARTITION
(dt=2015032412) SELECT author, article_id, sum(if(referrer_fields.type !=
'seed' AND clicktype='beauty',1,0)) AS viral_count,
sum(if(referrer_fields.type = 'seed' AND clicktype='beauty',1,0)) AS
seed_count, sum(if(clicktype='beauty',1,0)) AS pageview,
sum(if(clicktype='click',1,0)) AS clicks WHERE dt=2015032412 AND (author IS
NOT null OR article_id IS NOT NULL) GROUP by author,article_id,dt ORDER BY
viral_count DESC LIMIT 1000000000 INSERT OVERWRITE TABLE author_hourly
PARTITION (dt=2015032412) SELECT author, sum(if(referrer_fields.type !=
'seed' AND clicktype='beauty',1,0)) AS viral_count,
sum(if(referrer_fields.type = 'seed' AND clicktype='beauty',1,0)) AS
seed_count, sum(if(clicktype='beauty',1,0)) AS pageview,
sum(if(clicktype='click',1,0)) AS clicks WHERE dt=2015032412 AND (author IS
NOT null OR article_id IS NOT NULL) GROUP by author,dt ORDER BY viral_count
DESC LIMIT 1000000000


1) I have confirmed that the source table had data at the time of the query
2) The jdbc statement did not throw an exception.
3) The jobs that produced one empty output file ran as long as those that
produced data.

           1            1             655720
/user/hive/warehouse/author_hourly/dt=2015032222
           1            1                  0
/user/hive/warehouse/author_hourly/dt=2015032223
           1            1             644289
/user/hive/warehouse/author_hourly/dt=2015032300
           1            1                  0
/user/hive/warehouse/author_hourly/dt=2015032301
           1            1             640076
/user/hive/warehouse/author_hourly/dt=2015032302
           1            1                  0
/user/hive/warehouse/author_hourly/dt=2015032303
           1            1                  0
/user/hive/warehouse/author_hourly/dt=2015032304
           1            1             715033
/user/hive/warehouse/author_hourly/dt=2015032320
           1            1             691352
/user/hive/warehouse/author_hourly/dt=2015032321
           1            1                  0
/user/hive/warehouse/author_hourly/dt=2015032322
           1            1             653690
/user/hive/warehouse/author_hourly/dt=2015032323
           1            1                  0
/user/hive/warehouse/author_hourly/dt=2015032400
           1            1             650930
/user/hive/warehouse/author_hourly/dt=2015032401
           1            1             639389
/user/hive/warehouse/author_hourly/dt=2015032402
           1            1                  0
/user/hive/warehouse/author_hourly/dt=2015032403
           1            1             544848
/user/hive/warehouse/author_hourly/dt=2015032404
           1            1             495953
/user/hive/warehouse/author_hourly/dt=2015032405
           1            1                  0
/user/hive/warehouse/author_hourly/dt=2015032406
           1            1                  0
/user/hive/warehouse/author_hourly/dt=2015032407
           1            1             425209
/user/hive/warehouse/author_hourly/dt=2015032408
           1            1             443696
/user/hive/warehouse/author_hourly/dt=2015032409
           1            1             472888
/user/hive/warehouse/author_hourly/dt=2015032410
           1            1                  0
/user/hive/warehouse/author_hourly/dt=2015032411
           1            1                  0
/user/hive/warehouse/author_hourly/dt=2015032412
           1            1                  0
/user/hive/warehouse/author_hourly/dt=2015032413

I have turned hiveserver2 logging up to debug. There are no logs at level
error.

The folders with 0 bytes have a single empty file in them named 0_000000

Drifting thought job history server logs I have found this:

2015-03-24 16:16:35,446 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
completed container container_1422629510062_294146_01_000005
2015-03-24 16:16:35,447 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After
Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:1
AssignedMaps:0 AssignedReds:0 CompletedMaps:1 CompletedReds:0
ContAlloc:1 ContRel:0 HostLocal:0 RackLocal:1
2015-03-24 16:16:35,448 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
Diagnostics report from attempt_1422629510062_294146_m_000000_0:
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

Can anyone explain why queries launched through HiveServer2 sometimes move
empty files to the final directory?  I am pretty clueless as to the clause.
I am assuming load on the cluster is killing Yarn stuff and hive may not be
trapping these kills and producing no output? (Just a theory)

Thanks

The curious case of the hive-server-2 empty partitions.

Reply via email to