Re: Hive On Spark - ORC Table - Hive Streaming Mutation API

Benjamin Schaff Wed, 14 Sep 2016 11:32:44 -0700

Hi,

Thanks for the answer.


I am running on a custom build of spark 1.6.2 meaning the one given in the
hive documentation so without hive jars.
I set it up in hive-env.sh.

I created the istari table like in the documentation and I run INSERT on it
then a GROUP BY.
Everything went on spark standalone cluster correctly not exception nowhere.

Do you have any other suggestion ?

Thanks.

Le mer. 14 sept. 2016 à 13:55, Mich Talebzadeh <mich.talebza...@gmail.com>
a écrit :

> Hi,
>
> You are using Hive 2. What is the Spark version that runs as Hive
> execution engine?
>
> I cannot see spark.home in your hive-site.xml so I cannot figure it out.
>
> BTW you are using Spark standalone as the mode. I tend to use yarn-client.
>
> Now back to the above issue. Do other queries work OK with Hive on Spark?
>
> Some of those perf parameters can be set up in Hive session itself or
> through init file
>
>  set spark.home=/usr/lib/spark-1.6.2-bin-hadoop2.6;
> set spark.master=yarn;
> set spark.deploy.mode=client;
> set spark.executor.memory=8g;
> set spark.driver.memory=8g;
> set spark.executor.instances=6;
> set spark.ui.port=7777;
>
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 14 September 2016 at 18:28, Benjamin Schaff <benjamin.sch...@gmail.com>
> wrote:
>
>> Hi,
>>
>> After several days trying to figure out the problem I'm stuck with a
>> class cast exception when running a query with hive on spark on orc tables
>> that I updated with the streaming mutation api of hive 2.0.
>>
>> The context is the following:
>>
>> For hive:
>>
>> The version is the latest available from the website 2.1
>> I created some scala code to insert data into an orc table with the
>> streaming mutation api followed the example provided somewhere in the hive
>> repository.
>>
>> The table looks like that:
>>
>> +--------------------------------------------------------------------+--+
>> |                           createtab_stmt                           |
>> +--------------------------------------------------------------------+--+
>> | CREATE TABLE `hc__member`(                                         |
>> |   `rdv_core__key` bigint,                                          |
>> |   `rdv_core__domainkey` string,                                    |
>> |   `rdftypes` array<string>,                                        |
>> |   `rdv_org__firstname` string,                                     |
>> |   `rdv_org__middlename` string,                                    |
>> |   `rdv_org__lastname` string,                                      |
>> |   `rdv_org__gender` string,                                        |
>> |   `rdv_org__city` string,                                          |
>> |   `rdv_org__state` string,                                         |
>> |   `rdv_org__countrycode` string,                                   |
>> |   `rdv_org__addresslabel` string,                                  |
>> |   `rdv_org__zip` string)                                           |
>> | CLUSTERED BY (                                                     |
>> |   rdv_core__key)                                                   |
>> | INTO 24 BUCKETS                                                    |
>> | ROW FORMAT SERDE                                                   |
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                      |
>> | STORED AS INPUTFORMAT                                              |
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                |
>> | OUTPUTFORMAT                                                       |
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'               |
>> | LOCATION                                                           |
>> |   'hdfs://hmaster:8020/user/hive/warehouse/hc__member'             |
>> | TBLPROPERTIES (                                                    |
>> |   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',            |
>> |   'compactor.mapreduce.map.memory.mb'='2048',                      |
>> |   'compactorthreshold.hive.compactor.delta.num.threshold'='4',     |
>> |   'compactorthreshold.hive.compactor.delta.pct.threshold'='0.5',   |
>> |   'numFiles'='0',                                                  |
>> |   'numRows'='0',                                                   |
>> |   'rawDataSize'='0',                                               |
>> |   'totalSize'='0',                                                 |
>> |   'transactional'='true',                                          |
>> |   'transient_lastDdlTime'='1473792939')                            |
>> +--------------------------------------------------------------------+--+
>>
>> The hive site looks like that:
>>
>> <configuration>
>>  <property>
>>     <name>hive.execution.engine</name>
>>     <value>spark</value>
>>   </property>
>>   <property>
>>     <name>spark.master</name>
>>     <value>spark://hmaster:7077</value>
>>   </property>
>>   <property>
>>     <name>spark.eventLog.enabled</name>
>>     <value>false</value>
>>   </property>
>>   <property>
>>     <name>spark.executor.memory</name>
>>     <value>12g</value>
>>   </property>
>>   <property>
>>     <name>spark.serializer</name>
>>     <value>org.apache.spark.serializer.KryoSerializer</value>
>>   </property>
>>   <property>
>>     <name>mapreduce.input.fileinputformat.split.maxsize</name>
>>     <value>750000000</value>
>>   </property>
>>   <property>
>>     <name>hive.vectorized.execution.enabled</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.cbo.enable</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.optimize.reducededuplication.min.reducer</name>
>>     <value>4</value>
>>   </property>
>>   <property>
>>     <name>hive.optimize.reducededuplication</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.orc.splits.include.file.footer</name>
>>     <value>false</value>
>>   </property>
>>   <property>
>>     <name>hive.merge.mapfiles</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.merge.sparkfiles</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.merge.smallfiles.avgsize</name>
>>     <value>16000000</value>
>>   </property>
>>   <property>
>>     <name>hive.merge.size.per.task</name>
>>     <value>256000000</value>
>>   </property>
>>   <property>
>>     <name>hive.merge.orcfile.stripe.level</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.auto.convert.join</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.auto.convert.join.noconditionaltask</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.auto.convert.join.noconditionaltask.size</name>
>>     <value>894435328</value>
>>   </property>
>>   <property>
>>     <name>hive.optimize.bucketmapjoin.sortedmerge</name>
>>     <value>false</value>
>>   </property>
>>   <property>
>>     <name>hive.map.aggr.hash.percentmemory</name>
>>     <value>0.5</value>
>>   </property>
>>   <property>
>>     <name>hive.map.aggr</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.optimize.sort.dynamic.partition</name>
>>     <value>false</value>
>>   </property>
>>   <property>
>>     <name>hive.stats.autogather</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.stats.fetch.column.stats</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.vectorized.execution.reduce.enabled</name>
>>     <value>false</value>
>>   </property>
>>   <property>
>>     <name>hive.vectorized.groupby.checkinterval</name>
>>     <value>4096</value>
>>   </property>
>>   <property>
>>     <name>hive.vectorized.groupby.flush.percent</name>
>>     <value>0.1</value>
>>   </property>
>>   <property>
>>     <name>hive.compute.query.using.stats</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.limit.pushdown.memory.usage</name>
>>     <value>0.4</value>
>>   </property>
>>   <property>
>>     <name>hive.optimize.index.filter</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.exec.reducers.bytes.per.reducer</name>
>>     <value>67108864</value>
>>   </property>
>>   <property>
>>     <name>hive.smbjoin.cache.rows</name>
>>     <value>10000</value>
>>   </property>
>>   <property>
>>     <name>hive.exec.orc.default.stripe.size</name>
>>     <value>67108864</value>
>>   </property>
>>   <property>
>>     <name>hive.fetch.task.conversion</name>
>>     <value>more</value>
>>   </property>
>>   <property>
>>     <name>hive.fetch.task.conversion.threshold</name>
>>     <value>1073741824</value>
>>   </property>
>>   <property>
>>     <name>hive.fetch.task.aggr</name>
>>     <value>false</value>
>>   </property>
>>   <property>
>>     <name>mapreduce.input.fileinputformat.list-status.num-threads</name>
>>     <value>5</value>
>>   </property>
>>   <property>
>>     <name>spark.kryo.referenceTracking</name>
>>     <value>false</value>
>>   </property>
>>   <property>
>>     <name>spark.kryo.classesToRegister</name>
>>
>> <value>org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch</value>
>>   </property>
>>   <property>
>>     <name>hadoop.proxyuser.hive.groups</name>
>>     <value>*</value>
>>   </property>
>>   <property>
>>     <name>hadoop.proxyuser.hive.hosts</name>
>>     <value>*</value>
>>   </property>
>>   <property>
>>     <name>hive.server2.enable.doAs</name>
>>     <value>false</value>
>>   </property>
>>   <property>
>>     <name>hive.server2.authentication</name>
>>     <value>NONE</value>
>>   </property>
>>   <property>
>>     <name>hive.support.concurrency</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.exec.dynamic.partition.mode</name>
>>     <value>nonstrict</value>
>>   </property>
>>   <property>
>>     <name>hive.txn.manager</name>
>>     <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
>>   </property>
>>   <property>
>>     <name>hive.compactor.initiator.on</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>hive.compactor.worker.threads</name>
>>     <value>4</value>
>>   </property>
>>   <property>
>>       <name>javax.jdo.option.ConnectionURL</name>
>>
>> <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
>>       <description>metadata is stored in a MySQL server</description>
>>    </property>
>>    <property>
>>       <name>javax.jdo.option.ConnectionDriverName</name>
>>       <value>com.mysql.jdbc.Driver</value>
>>       <description>MySQL JDBC driver class</description>
>>    </property>
>>    <property>
>>       <name>javax.jdo.option.ConnectionUserName</name>
>>       <value>hadoop</value>
>>       <description>user name for connecting to mysql server</description>
>>    </property>
>>    <property>
>>       <name>javax.jdo.option.ConnectionPassword</name>
>>       <value></value>
>>       <description>password for connecting to mysql server</description>
>>    </property>
>>    <property>
>>     <name>hive.metastore.uris</name>
>>     <value>thrift://localhost:9083</value>
>>   </property>
>>   <property>
>>     <name>hive.root.logger</name>
>>     <value>WARN,RFA</value>
>>   </property>
>> </configuration>
>>
>> Whenever I run a query involving spark I go the following error:
>>
>> java.io.IOException: java.io.IOException: error iterating
>>      at 
>> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>>      at 
>> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>>      at 
>> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)
>>      at 
>> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
>>      at 
>> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
>>      at 
>> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
>>      at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:246)
>>      at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:208)
>>      at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>>      at 
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>>      at 
>> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:29)
>>      at 
>> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:93)
>>      at 
>> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
>>      at 
>> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>>      at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>      at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>      at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>      at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.io.IOException: error iterating
>>      at 
>> org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:92)
>>      at 
>> org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:42)
>>      at 
>> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350)
>>      ... 18 more
>> Caused by: java.lang.ClassCastException: 
>> org.apache.hadoop.hive.ql.io.orc.OrcStruct$OrcListObjectInspector cannot be 
>> cast to 
>> org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
>>      at 
>> org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.setVector(VectorizedBatchUtil.java:311)
>>      at 
>> org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.acidAddRowToBatch(VectorizedBatchUtil.java:291)
>>      at 
>> org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:82)
>>      ... 20 more
>>
>>
>> And what I mean by involving spark is that for a select * that does not
>> run against the spark backend I can see the data in the table but when I do
>> a query involving a group by for instance or just trying to extract a
>> column value, it runs through spark and I go this exception.
>>
>> I also tried the Streaming API, with the same problem I tried a custom
>> writer and a jsonwriter.
>> I build myself the spark distribution removing hive related dependencies
>> so I don't think it comes from there.
>>
>> Have you any recommendations on how I can proceed to find the root cause
>> of that problem ?
>>
>> Thanks in advance.
>>
>> PS: I made the mistake of posting on the dev mailing list earlier please
>> ignore it and sorry for the double post.
>>
>> Regards,
>> Benjamin Schaff
>>
>>
>

Re: Hive On Spark - ORC Table - Hive Streaming Mutation API

Reply via email to