Peter Csaszar created HIVE-17532:
------------------------------------

             Summary: Hive on Spark query compilation starts Spark session
                 Key: HIVE-17532
                 URL: https://issues.apache.org/jira/browse/HIVE-17532
             Project: Hive
          Issue Type: Bug
          Components: HiveServer2
    Affects Versions: 2.2.0
            Reporter: Peter Csaszar
            Priority: Minor


Hive on Spark query compilation starts a new Spark session when some kind of 
aggregation is present:

0: jdbc:hive2://localhost:10000/default> set hive.execution.engine=spark;
No rows affected (0.013 seconds)
0: jdbc:hive2://localhost:10000/default> explain select distinct label0 from 
iris;
INFO  : Compiling 
command(queryId=hive_20170912151212_914ee322-28dd-442a-9dd9-7ed00a6a8caf): 
explain select distinct label0 from iris
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, 
type:string, comment:null)], properties:null)
INFO  : Completed compiling 
command(queryId=hive_20170912151212_914ee322-28dd-442a-9dd9-7ed00a6a8caf); Time 
taken: *40.594* seconds

Spark job started, all consecutive explain statements are fast:

0: jdbc:hive2://localhost:10000/default> explain select distinct a1 from iris;
INFO  : Compiling 
command(queryId=hive_20170912151414_faacda24-290e-48bb-9daf-3f301fc170c1): 
explain select distinct label0 from iris
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, 
type:string, comment:null)], properties:null)
INFO  : Completed compiling 
command(queryId=hive_20170912151414_faacda24-290e-48bb-9daf-3f301fc170c1); Time 
taken: *0.275* seconds

Killing the Spark job, the same query is still fast, and no new Spark job has 
been started:

0: jdbc:hive2://localhost:10000/default> explain select distinct a2 from iris;
INFO  : Compiling 
command(queryId=hive_20170912151616_a7ea83b6-03ce-4636-b3d4-be6feadcde35): 
explain select distinct label0 from iris
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, 
type:string, comment:null)], properties:null)
INFO  : Completed compiling 
command(queryId=hive_20170912151616_a7ea83b6-03ce-4636-b3d4-be6feadcde35); Time 
taken: *0.213* seconds

The code in question:
SetSparkReducerParallelism.java:
sparkSessionManager = SparkSessionManagerImpl.getInstance();
sparkSession = SparkUtilities.getSparkSession(context.getConf(), 
sparkSessionManager);
sparkMemoryAndCores = sparkSession.getMemoryAndCores();

The created Spark session is used for getting the number of cores and memory 
only. This could be determined from the configurations, without actually 
starting a session.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to