[ https://issues.apache.org/jira/browse/HIVE-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shubham Chaurasia updated HIVE-25449: ------------------------------------- Description: Repro (thanks Qiaosong Dong) - Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}} {code} create external table test_dt(id string, dt date); insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07'); select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt on dt1.id = dt.id; +------+ | _c0 | +------+ | 6 | | 7 | +------+ {code} Expected output - {code} +------+ | _c0 | +------+ | 5 | | 6 | +------+ {code} *Cause* This happens because in {{VectorUDFDateDiffColScalar}} class 1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to parse the date strings which interprets it to be local timezone. 2. For first column we get a column vector which represents the date as epoch day. This is always in UTC. *Solution* We need to check other variants of datediff UDFs as well and change the parsing mechanism to always interpret date strings in UTC. I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue. {code} - date.setTime(formatter.parse(new String(bytesValue, "UTF-8")).getTime()); - baseDate = DateWritableV2.dateToDays(date); + org.apache.hadoop.hive.common.type.Date hiveDate + = org.apache.hadoop.hive.common.type.Date.valueOf(new String(bytesValue, "UTF-8")); + date.setTime(hiveDate.toEpochMilli()); + baseDate = hiveDate.toEpochDay(); {code} was: Repro (thanks Qiaosong Dong) - 1. Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}} {code} create external table test_dt(id string, dt date); insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07'); select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt on dt1.id = dt.id; +------+ | _c0 | +------+ | 6 | | 7 | +------+ {code} Expected output - {code} +------+ | _c0 | +------+ | 5 | | 6 | +------+ {code} *Cause* This happens because in {{VectorUDFDateDiffColScalar}} class 1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to parse the date strings which interprets it to be local timezone. 2. For first column we get a column vector which represents the date as epoch day. This is always in UTC. *Solution* We need to check other variants of datediff UDFs as well and change the parsing mechanism to always interpret date strings in UTC. I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue. {code} - date.setTime(formatter.parse(new String(bytesValue, "UTF-8")).getTime()); - baseDate = DateWritableV2.dateToDays(date); + org.apache.hadoop.hive.common.type.Date hiveDate + = org.apache.hadoop.hive.common.type.Date.valueOf(new String(bytesValue, "UTF-8")); + date.setTime(hiveDate.toEpochMilli()); + baseDate = hiveDate.toEpochDay(); {code} > datediff() gives wrong output when we set tez.task.launch.cmd-opts to some > non UTC timezone > ------------------------------------------------------------------------------------------- > > Key: HIVE-25449 > URL: https://issues.apache.org/jira/browse/HIVE-25449 > Project: Hive > Issue Type: Bug > Components: UDF > Reporter: Shubham Chaurasia > Assignee: Shubham Chaurasia > Priority: Major > > Repro (thanks Qiaosong Dong) - > Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}} > {code} > create external table test_dt(id string, dt date); > insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07'); > select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt > on dt1.id = dt.id; > +------+ > | _c0 | > +------+ > | 6 | > | 7 | > +------+ > {code} > Expected output - > {code} > +------+ > | _c0 | > +------+ > | 5 | > | 6 | > +------+ > {code} > *Cause* > This happens because in {{VectorUDFDateDiffColScalar}} class > 1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to > parse the date strings which interprets it to be local timezone. > 2. For first column we get a column vector which represents the date as epoch > day. This is always in UTC. > *Solution* > We need to check other variants of datediff UDFs as well and change the > parsing mechanism to always interpret date strings in UTC. > > I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue. > {code} > - date.setTime(formatter.parse(new String(bytesValue, > "UTF-8")).getTime()); > - baseDate = DateWritableV2.dateToDays(date); > + org.apache.hadoop.hive.common.type.Date hiveDate > + = org.apache.hadoop.hive.common.type.Date.valueOf(new > String(bytesValue, "UTF-8")); > + date.setTime(hiveDate.toEpochMilli()); > + baseDate = hiveDate.toEpochDay(); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)