Hi, Thanks for the answer.
I am running on a custom build of spark 1.6.2 meaning the one given in the hive documentation so without hive jars. I set it up in hive-env.sh. I created the istari table like in the documentation and I run INSERT on it then a GROUP BY. Everything went on spark standalone cluster correctly not exception nowhere. Do you have any other suggestion ? Thanks. Le mer. 14 sept. 2016 à 13:55, Mich Talebzadeh <mich.talebza...@gmail.com> a écrit : > Hi, > > You are using Hive 2. What is the Spark version that runs as Hive > execution engine? > > I cannot see spark.home in your hive-site.xml so I cannot figure it out. > > BTW you are using Spark standalone as the mode. I tend to use yarn-client. > > Now back to the above issue. Do other queries work OK with Hive on Spark? > > Some of those perf parameters can be set up in Hive session itself or > through init file > > set spark.home=/usr/lib/spark-1.6.2-bin-hadoop2.6; > set spark.master=yarn; > set spark.deploy.mode=client; > set spark.executor.memory=8g; > set spark.driver.memory=8g; > set spark.executor.instances=6; > set spark.ui.port=7777; > > > HTH > > > > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 14 September 2016 at 18:28, Benjamin Schaff <benjamin.sch...@gmail.com> > wrote: > >> Hi, >> >> After several days trying to figure out the problem I'm stuck with a >> class cast exception when running a query with hive on spark on orc tables >> that I updated with the streaming mutation api of hive 2.0. >> >> The context is the following: >> >> For hive: >> >> The version is the latest available from the website 2.1 >> I created some scala code to insert data into an orc table with the >> streaming mutation api followed the example provided somewhere in the hive >> repository. >> >> The table looks like that: >> >> +--------------------------------------------------------------------+--+ >> | createtab_stmt | >> +--------------------------------------------------------------------+--+ >> | CREATE TABLE `hc__member`( | >> | `rdv_core__key` bigint, | >> | `rdv_core__domainkey` string, | >> | `rdftypes` array<string>, | >> | `rdv_org__firstname` string, | >> | `rdv_org__middlename` string, | >> | `rdv_org__lastname` string, | >> | `rdv_org__gender` string, | >> | `rdv_org__city` string, | >> | `rdv_org__state` string, | >> | `rdv_org__countrycode` string, | >> | `rdv_org__addresslabel` string, | >> | `rdv_org__zip` string) | >> | CLUSTERED BY ( | >> | rdv_core__key) | >> | INTO 24 BUCKETS | >> | ROW FORMAT SERDE | >> | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' | >> | STORED AS INPUTFORMAT | >> | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' | >> | OUTPUTFORMAT | >> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' | >> | LOCATION | >> | 'hdfs://hmaster:8020/user/hive/warehouse/hc__member' | >> | TBLPROPERTIES ( | >> | 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', | >> | 'compactor.mapreduce.map.memory.mb'='2048', | >> | 'compactorthreshold.hive.compactor.delta.num.threshold'='4', | >> | 'compactorthreshold.hive.compactor.delta.pct.threshold'='0.5', | >> | 'numFiles'='0', | >> | 'numRows'='0', | >> | 'rawDataSize'='0', | >> | 'totalSize'='0', | >> | 'transactional'='true', | >> | 'transient_lastDdlTime'='1473792939') | >> +--------------------------------------------------------------------+--+ >> >> The hive site looks like that: >> >> <configuration> >> <property> >> <name>hive.execution.engine</name> >> <value>spark</value> >> </property> >> <property> >> <name>spark.master</name> >> <value>spark://hmaster:7077</value> >> </property> >> <property> >> <name>spark.eventLog.enabled</name> >> <value>false</value> >> </property> >> <property> >> <name>spark.executor.memory</name> >> <value>12g</value> >> </property> >> <property> >> <name>spark.serializer</name> >> <value>org.apache.spark.serializer.KryoSerializer</value> >> </property> >> <property> >> <name>mapreduce.input.fileinputformat.split.maxsize</name> >> <value>750000000</value> >> </property> >> <property> >> <name>hive.vectorized.execution.enabled</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.cbo.enable</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.optimize.reducededuplication.min.reducer</name> >> <value>4</value> >> </property> >> <property> >> <name>hive.optimize.reducededuplication</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.orc.splits.include.file.footer</name> >> <value>false</value> >> </property> >> <property> >> <name>hive.merge.mapfiles</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.merge.sparkfiles</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.merge.smallfiles.avgsize</name> >> <value>16000000</value> >> </property> >> <property> >> <name>hive.merge.size.per.task</name> >> <value>256000000</value> >> </property> >> <property> >> <name>hive.merge.orcfile.stripe.level</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.auto.convert.join</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.auto.convert.join.noconditionaltask</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.auto.convert.join.noconditionaltask.size</name> >> <value>894435328</value> >> </property> >> <property> >> <name>hive.optimize.bucketmapjoin.sortedmerge</name> >> <value>false</value> >> </property> >> <property> >> <name>hive.map.aggr.hash.percentmemory</name> >> <value>0.5</value> >> </property> >> <property> >> <name>hive.map.aggr</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.optimize.sort.dynamic.partition</name> >> <value>false</value> >> </property> >> <property> >> <name>hive.stats.autogather</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.stats.fetch.column.stats</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.vectorized.execution.reduce.enabled</name> >> <value>false</value> >> </property> >> <property> >> <name>hive.vectorized.groupby.checkinterval</name> >> <value>4096</value> >> </property> >> <property> >> <name>hive.vectorized.groupby.flush.percent</name> >> <value>0.1</value> >> </property> >> <property> >> <name>hive.compute.query.using.stats</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.limit.pushdown.memory.usage</name> >> <value>0.4</value> >> </property> >> <property> >> <name>hive.optimize.index.filter</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.exec.reducers.bytes.per.reducer</name> >> <value>67108864</value> >> </property> >> <property> >> <name>hive.smbjoin.cache.rows</name> >> <value>10000</value> >> </property> >> <property> >> <name>hive.exec.orc.default.stripe.size</name> >> <value>67108864</value> >> </property> >> <property> >> <name>hive.fetch.task.conversion</name> >> <value>more</value> >> </property> >> <property> >> <name>hive.fetch.task.conversion.threshold</name> >> <value>1073741824</value> >> </property> >> <property> >> <name>hive.fetch.task.aggr</name> >> <value>false</value> >> </property> >> <property> >> <name>mapreduce.input.fileinputformat.list-status.num-threads</name> >> <value>5</value> >> </property> >> <property> >> <name>spark.kryo.referenceTracking</name> >> <value>false</value> >> </property> >> <property> >> <name>spark.kryo.classesToRegister</name> >> >> <value>org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch</value> >> </property> >> <property> >> <name>hadoop.proxyuser.hive.groups</name> >> <value>*</value> >> </property> >> <property> >> <name>hadoop.proxyuser.hive.hosts</name> >> <value>*</value> >> </property> >> <property> >> <name>hive.server2.enable.doAs</name> >> <value>false</value> >> </property> >> <property> >> <name>hive.server2.authentication</name> >> <value>NONE</value> >> </property> >> <property> >> <name>hive.support.concurrency</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.exec.dynamic.partition.mode</name> >> <value>nonstrict</value> >> </property> >> <property> >> <name>hive.txn.manager</name> >> <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value> >> </property> >> <property> >> <name>hive.compactor.initiator.on</name> >> <value>true</value> >> </property> >> <property> >> <name>hive.compactor.worker.threads</name> >> <value>4</value> >> </property> >> <property> >> <name>javax.jdo.option.ConnectionURL</name> >> >> <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value> >> <description>metadata is stored in a MySQL server</description> >> </property> >> <property> >> <name>javax.jdo.option.ConnectionDriverName</name> >> <value>com.mysql.jdbc.Driver</value> >> <description>MySQL JDBC driver class</description> >> </property> >> <property> >> <name>javax.jdo.option.ConnectionUserName</name> >> <value>hadoop</value> >> <description>user name for connecting to mysql server</description> >> </property> >> <property> >> <name>javax.jdo.option.ConnectionPassword</name> >> <value></value> >> <description>password for connecting to mysql server</description> >> </property> >> <property> >> <name>hive.metastore.uris</name> >> <value>thrift://localhost:9083</value> >> </property> >> <property> >> <name>hive.root.logger</name> >> <value>WARN,RFA</value> >> </property> >> </configuration> >> >> Whenever I run a query involving spark I go the following error: >> >> java.io.IOException: java.io.IOException: error iterating >> at >> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) >> at >> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) >> at >> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355) >> at >> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79) >> at >> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33) >> at >> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116) >> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:246) >> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:208) >> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) >> at >> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) >> at >> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:29) >> at >> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:93) >> at >> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) >> at >> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >> at org.apache.spark.scheduler.Task.run(Task.scala:89) >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: java.io.IOException: error iterating >> at >> org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:92) >> at >> org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:42) >> at >> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350) >> ... 18 more >> Caused by: java.lang.ClassCastException: >> org.apache.hadoop.hive.ql.io.orc.OrcStruct$OrcListObjectInspector cannot be >> cast to >> org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector >> at >> org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.setVector(VectorizedBatchUtil.java:311) >> at >> org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.acidAddRowToBatch(VectorizedBatchUtil.java:291) >> at >> org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:82) >> ... 20 more >> >> >> And what I mean by involving spark is that for a select * that does not >> run against the spark backend I can see the data in the table but when I do >> a query involving a group by for instance or just trying to extract a >> column value, it runs through spark and I go this exception. >> >> I also tried the Streaming API, with the same problem I tried a custom >> writer and a jsonwriter. >> I build myself the spark distribution removing hive related dependencies >> so I don't think it comes from there. >> >> Have you any recommendations on how I can proceed to find the root cause >> of that problem ? >> >> Thanks in advance. >> >> PS: I made the mistake of posting on the dev mailing list earlier please >> ignore it and sorry for the double post. >> >> Regards, >> Benjamin Schaff >> >> >