Hi,
Facing a bug with group by in SparkSQL (version 1.4).
Registered a JavaRDD with object containing integer fields as a table.
Then I'm trying to do a group by, with a constant value in the group by
fields:
SELECT primary_one, primary_two, 10 as num, SUM(measure) as total_measures
FROM tbl
GROUP BY primary_one, primary_two, num
I get the following exception:
org.apache.spark.sql.AnalysisException: cannot resolve 'num' given input
columns measure, primary_one, primary_two
Tried both with HiveContext and SqlContext.
The odd thing is that this kind of query actually works for me in a project
I'm working on, but I have there another bug (the group by does not yield
expected results).
The only reason I can think of is that maybe in my real project, the
context configuration is different.
In my above example the configuration of the HiveContext is empty.
In my real project, the configuration is shown below.
Any ideas?
Thanks,
Lior
Hive context configuration in project:
"(mapreduce.jobtracker.jobhistory.task.numberprogresssplits,12)"
"(nfs3.mountd.port,4242)"
"(mapreduce.tasktracker.healthchecker.script.timeout,600000)"
"(yarn.app.mapreduce.am.scheduler.heartbeat.interval-ms,1000)"
"(mapreduce.input.fileinputformat.input.dir.recursive,false)"
"(hive.orc.compute.splits.num.threads,10)"
"(mapreduce.job.classloader.system.classes,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.)"
"(hive.auto.convert.sortmerge.join.to.mapjoin,false)"
"(hadoop.http.authentication.kerberos.principal,HTTP/_HOST@LOCALHOST)"
"(hive.exec.perf.logger,org.apache.hadoop.hive.ql.log.PerfLogger)"
"(hive.mapjoin.lazy.hashtable,true)"
"(mapreduce.framework.name,local)"
"(hive.exec.script.maxerrsize,100000)"
"(dfs.namenode.checkpoint.txns,1000000)"
"(tfile.fs.output.buffer.size,262144)"
"(yarn.app.mapreduce.am.job.task.listener.thread-count,30)"
"(mapreduce.tasktracker.local.dir.minspacekill,0)"
"(hive.support.concurrency,false)"
"(fs.s3.block.size,67108864)"
"(hive.script.recordwriter,org.apache.hadoop.hive.ql.exec.TextRecordWriter)"
"(hive.stats.retries.max,0)"
"(hadoop.hdfs.configuration.version,1)"
"(dfs.bytes-per-checksum,512)"
"(fs.s3.buffer.dir,${hadoop.tmp.dir}/s3)"
"(mapreduce.job.acl-view-job, )"
"(hive.typecheck.on.insert,true)"
"(mapreduce.jobhistory.loadedjobs.cache.size,5)"
"(mapreduce.jobtracker.persist.jobstatus.hours,1)"
"(hive.unlock.numretries,10)"
"(dfs.namenode.handler.count,10)"
"(mapreduce.input.fileinputformat.split.minsize,1)"
"(hive.plan.serialization.format,kryo)"
"(dfs.datanode.failed.volumes.tolerated,0)"
"(yarn.resourcemanager.container.liveness-monitor.interval-ms,600000)"
"(yarn.resourcemanager.amliveliness-monitor.interval-ms,1000)"
"(yarn.resourcemanager.client.thread-count,50)"
"(io.seqfile.compress.blocksize,1000000)"
"(mapreduce.tasktracker.http.threads,40)"
"(hive.explain.dependency.append.tasktype,false)"
"(dfs.namenode.retrycache.expirytime.millis,600000)"
"(dfs.namenode.backup.address,0.0.0.0:50100)"
"(hive.hwi.listen.host,0.0.0.0)"
"(dfs.datanode.data.dir,file://${hadoop.tmp.dir}/dfs/data)"
"(dfs.replication,3)"
"(mapreduce.jobtracker.jobhistory.block.size,3145728)"
"(dfs.secondary.namenode.kerberos.internal.spnego.principal,${dfs.web.authentication.kerberos.principal})"
"(mapreduce.task.profile.maps,0-2)"
"(fs.har.impl,org.apache.hadoop.hive.shims.HiveHarFileSystem)"
"(hive.stats.reliable,false)"
"(yarn.nodemanager.admin-env,MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX)"