spark sql - group by constant column

Lior Chaga Wed, 15 Jul 2015 00:11:53 -0700

Hi,

Facing a bug with group by in SparkSQL (version 1.4).
Registered a JavaRDD with object containing integer fields as a table.


Then I'm trying to do a group by, with a constant value in the group by
fields:

SELECT primary_one, primary_two, 10 as num, SUM(measure) as total_measures
FROM tbl
GROUP BY primary_one, primary_two, num


I get the following exception:
org.apache.spark.sql.AnalysisException: cannot resolve 'num' given input
columns measure, primary_one, primary_two

Tried both with HiveContext and SqlContext.
The odd thing is that this kind of query actually works for me in a project
I'm working on, but I have there another bug (the group by does not yield
expected results).

The only reason I can think of is that maybe in my real project, the
context configuration is different.
In my above example the configuration of the HiveContext is empty.

In my real project, the configuration is shown below.
Any ideas?

Thanks,
Lior

Hive context configuration in project:
"(mapreduce.jobtracker.jobhistory.task.numberprogresssplits,12)"
"(nfs3.mountd.port,4242)"
"(mapreduce.tasktracker.healthchecker.script.timeout,600000)"
"(yarn.app.mapreduce.am.scheduler.heartbeat.interval-ms,1000)"
"(mapreduce.input.fileinputformat.input.dir.recursive,false)"
"(hive.orc.compute.splits.num.threads,10)"
"(mapreduce.job.classloader.system.classes,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.)"
"(hive.auto.convert.sortmerge.join.to.mapjoin,false)"
"(hadoop.http.authentication.kerberos.principal,HTTP/_HOST@LOCALHOST)"
"(hive.exec.perf.logger,org.apache.hadoop.hive.ql.log.PerfLogger)"
 "(hive.mapjoin.lazy.hashtable,true)"
 "(mapreduce.framework.name,local)"
 "(hive.exec.script.maxerrsize,100000)"
 "(dfs.namenode.checkpoint.txns,1000000)"
 "(tfile.fs.output.buffer.size,262144)"
 "(yarn.app.mapreduce.am.job.task.listener.thread-count,30)"
 "(mapreduce.tasktracker.local.dir.minspacekill,0)"
 "(hive.support.concurrency,false)"
 "(fs.s3.block.size,67108864)"
 "(hive.script.recordwriter,org.apache.hadoop.hive.ql.exec.TextRecordWriter)"
 "(hive.stats.retries.max,0)"
 "(hadoop.hdfs.configuration.version,1)"
 "(dfs.bytes-per-checksum,512)"
 "(fs.s3.buffer.dir,${hadoop.tmp.dir}/s3)"
 "(mapreduce.job.acl-view-job, )"
 "(hive.typecheck.on.insert,true)"
 "(mapreduce.jobhistory.loadedjobs.cache.size,5)"
 "(mapreduce.jobtracker.persist.jobstatus.hours,1)"
 "(hive.unlock.numretries,10)"
 "(dfs.namenode.handler.count,10)"
 "(mapreduce.input.fileinputformat.split.minsize,1)"
 "(hive.plan.serialization.format,kryo)"
 "(dfs.datanode.failed.volumes.tolerated,0)"
 "(yarn.resourcemanager.container.liveness-monitor.interval-ms,600000)"
 "(yarn.resourcemanager.amliveliness-monitor.interval-ms,1000)"
 "(yarn.resourcemanager.client.thread-count,50)"
 "(io.seqfile.compress.blocksize,1000000)"
 "(mapreduce.tasktracker.http.threads,40)"
 "(hive.explain.dependency.append.tasktype,false)"
 "(dfs.namenode.retrycache.expirytime.millis,600000)"
 "(dfs.namenode.backup.address,0.0.0.0:50100)"
 "(hive.hwi.listen.host,0.0.0.0)"
 "(dfs.datanode.data.dir,file://${hadoop.tmp.dir}/dfs/data)"
 "(dfs.replication,3)"
 "(mapreduce.jobtracker.jobhistory.block.size,3145728)"
 
"(dfs.secondary.namenode.kerberos.internal.spnego.principal,${dfs.web.authentication.kerberos.principal})"
 "(mapreduce.task.profile.maps,0-2)"
 "(fs.har.impl,org.apache.hadoop.hive.shims.HiveHarFileSystem)"
 "(hive.stats.reliable,false)"
 "(yarn.nodemanager.admin-env,MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX)"

spark sql - group by constant column

Reply via email to