[ https://issues.apache.org/jira/browse/HIVE-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Zhang reassigned HIVE-21935: ------------------------------------ Assignee: Richard Zhang (was: Mustafa Iman) > Hive Vectorization : degraded performance with vectorize UDF > -------------------------------------------------------------- > > Key: HIVE-21935 > URL: https://issues.apache.org/jira/browse/HIVE-21935 > Project: Hive > Issue Type: Bug > Components: Vectorization > Affects Versions: 3.1.1 > Environment: Hive-3, JDK-8 > Reporter: Rajkumar Singh > Assignee: Richard Zhang > Priority: Major > Labels: performance > Attachments: CustomSplit-1.0-SNAPSHOT.jar > > > with vectorization turned on and hive.vectorized.adaptor.usage.mode=all we > were seeing severe performance degradation. looking at the task jstacks it > seems that it is running the code which vectorizes UDF and stuck in some loop. > {code:java} > jstack -l 14954 | grep 0x3af0 -A20 > "TezChild" #15 daemon prio=5 os_prio=0 tid=0x00007f157538d800 nid=0x3af0 > runnable [0x00007f1547581000] > java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:573) > at > org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:350) > at > org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:205) > at > org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:150) > at > org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271) > at > org.apache.hadoop.hive.ql.exec.vector.expressions.ListIndexColScalar.evaluate(ListIndexColScalar.java:59) > at > org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:146) > at > org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:965) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:938) > at > org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125) > at > org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:889) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > [yarn@hdp32b ~]$ jstack -l 14954 | grep 0x3af0 -A20 > "TezChild" #15 daemon prio=5 os_prio=0 tid=0x00007f157538d800 nid=0x3af0 > runnable [0x00007f1547581000] > java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector.ensureSize(BytesColumnVector.java:554) > at > org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:570) > at > org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:350) > at > org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:205) > at > org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:150) > at > org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271) > at > org.apache.hadoop.hive.ql.exec.vector.expressions.ListIndexColScalar.evaluate(ListIndexColScalar.java:59) > at > org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:146) > at > org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:965) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:938) > at > org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125) > at > org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:889) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > {code} > after setting the hive.vectorized.adaptor.usage.mode=none query did complete > much faster. > Steps To Reproduce: > 1. Create Table: > {code} > +----------------------------------------------------+ > | createtab_stmt | > +----------------------------------------------------+ > | CREATE EXTERNAL TABLE `splittestloc`( | > | `id` int, | > | `value` string) | > | ROW FORMAT SERDE | > | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | > | WITH SERDEPROPERTIES ( | > | 'field.delim'=',', | > | 'serialization.format'=',') | > | STORED AS INPUTFORMAT | > | 'org.apache.hadoop.mapred.TextInputFormat' | > | OUTPUTFORMAT | > | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | > | LOCATION | > | 'hdfs://hdp31a.hdp.local:8020/tmp/splittableloc' | > | TBLPROPERTIES ( | > | 'bucketing_version'='2', | > | 'transient_lastDdlTime'='1561482451') | > +----------------------------------------------------+ > {code} > 2. Sample data: table has some 40M rows and sample data is generated using > following script. > {code} > for i in {1..40000000} ; do echo $i,"start#mid#"$i >> data.log ; done > {code} > 3. I believe this should be reproducible with hive generic split but I am > attaching the custom UDF to split the string. > 4. create a function > {code} > add jar /tmp/CustomSplit-1.0-SNAPSHOT.jar; > create temporary function mysplit as 'com.rajkrrsingh.split.test.CustomSplit' > {code} > 5. run the following query which will reproduce the issue if vectorization > turned on. > {code} > create temporary table tmp2 as select id,mysplit(value,"#")[2] from > splittestloc > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)