Rajesh Balamohan created HIVE-15138: ---------------------------------------
Summary: String + Integer gets converted to UDFToDouble causing number format exceptions Key: HIVE-15138 URL: https://issues.apache.org/jira/browse/HIVE-15138 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan Priority: Minor TPCDS Query 72 has {{"d3.d_date > d1.d_date + 5"}} where in, d_date contains data like {{2002-02-03, 2001-11-07}}. When running this query, compiler converts this into UDFToDouble and causes large number of {{NumberFormatExceptions}} trying to convert string to double. Example Stack trace is given below, which can be a good amount of perf hit filling up the stack for every row, depending on the amount of data. {noformat} "TezTaskRunner" #41340 daemon prio=5 os_prio=0 tid=0x00007f7914745000 nid=0x9725 runnable [0x00007f787ee4a000] java.lang.Thread.State: RUNNABLE at java.lang.Throwable.fillInStackTrace(Native Method) at java.lang.Throwable.fillInStackTrace(Throwable.java:783) - locked <0x00007f804b125ab0> (a java.lang.NumberFormatException) at java.lang.Throwable.<init>(Throwable.java:265) at java.lang.Exception.<init>(Exception.java:66) at java.lang.RuntimeException.<init>(RuntimeException.java:62) at java.lang.IllegalArgumentException.<init>(IllegalArgumentException.java:52) at java.lang.NumberFormatException.<init>(NumberFormatException.java:55) at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110) at java.lang.Double.parseDouble(Double.java:538) at org.apache.hadoop.hive.ql.udf.UDFToDouble.evaluate(UDFToDouble.java:172) at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:967) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:194) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:194) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:150) at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:121) at org.apache.hadoop.hive.ql.exec.vector.expressions.gen.FilterDoubleColGreaterDoubleColumn.evaluate(FilterDoubleColGreaterDoubleColumn.java:51) at org.apache.hadoop.hive.ql.exec.vector.VectorFilterOperator.process(VectorFilterOperator.java:110) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879) at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:144) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.VectorMapJoinGenerateResultOperator.forwardBigTableBatch(VectorMapJoinGenerateResultOperator.java:600) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.VectorMapJoinInnerLongOperator.process(VectorMapJoinInnerLongOperator.java:386) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.VectorMapJoinGenerateResultOperator.forwardBigTableBatch(VectorMapJoinGenerateResultOperator.java:600) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.VectorMapJoinInnerLongOperator.process(VectorMapJoinInnerLongOperator.java:386) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879) {noformat} Simple query to reproduce this issue is given below. It would be helpful if hive gives explicit WARN messages so that end user can add explicit casts to avoid such situations. {noformat} Latest Hive (master): (Check UDFToDouble for d_date field) ==================== hive> explain select distinct d_date + 5 from date_dim limit 10; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: rbalamohan_20161107005816_1cc412bf-c19c-45c4-b468-236e4fc8ae09:8 Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE) DagName: Vertices: Map 1 Map Operator Tree: TableScan alias: date_dim Statistics: Num rows: 73049 Data size: 82034027 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: (UDFToDouble(d_date) + 5.0) (type: double) outputColumnNames: _col0 Statistics: Num rows: 73049 Data size: 82034027 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: _col0 (type: double) mode: hash outputColumnNames: _col0 Statistics: Num rows: 73049 Data size: 82034027 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: double) sort order: + Map-reduce partition columns: _col0 (type: double) Statistics: Num rows: 73049 Data size: 82034027 Basic stats: COMPLETE Column stats: NONE TopN Hash Memory Usage: 0.04 Execution mode: vectorized, llap LLAP IO: all inputs Reducer 2 Execution mode: vectorized, llap Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: double) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 36524 Data size: 41016452 Basic stats: COMPLETE Column stats: NONE Limit Number of rows: 10 Statistics: Num rows: 10 Data size: 11230 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 10 Data size: 11230 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: 10 Processor Tree: ListSink {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)