Re: Hive double-precision question

Mark Grover Fri, 07 Dec 2012 14:02:29 -0800

Periya:
If you want to see what the built in Hive UDFs are doing, the code is here:
https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic
and
https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf


You can find out which UDF name maps to what class by looking at
https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java

If my memory serves me right, there was some "interesting" stuff Hive does
when mapping Java types to Hive datatypes. I am not sure how relevant it is
to this discussion but I will have to look further to comment more.

In the meanwhile take a look at the UDF code and see if your personal Java
code on Linux is equivalent to the Hive UDF code.

Keep us posted!
Mark

On Fri, Dec 7, 2012 at 1:27 PM, Periya.Data <periya.d...@gmail.com> wrote:

> Hi Hive Users,
>     I recently noticed an interesting behavior with Hive and I am unable
> to find the reason for it. Your insights into this is much appreciated.
>
> I am trying to compute the distance between two zip codes. I have the
> distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF
> and using Hive's built-in functions. There are some discrepancies from the
> 3rd decimal place when I see the output got from using Hive UDF and Hive's
> built-in functions. Here is an example:
>
> zip1          zip 2          Hadoop Built-in function
> SAS                      R                                       Linux +
> Java
> 00501   11720   4.49493083698542000 4.49508858 4.49508858054005
> 4.49508857976933000
> The formula used to compute distance is this (UDF):
>
>         double long1 = Math.atan(1)/45 * ux;
>         double lat1 = Math.atan(1)/45 * uy;
>         double long2 = Math.atan(1)/45 * mx;
>         double lat2 = Math.atan(1)/45 * my;
>
>         double X1 = long1;
>         double Y1 = lat1;
>         double X2 = long2;
>         double Y2 = lat2;
>
>         double distance = 3949.99 * Math.acos(Math.sin(Y1) *
>                 Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 -
> X2));
>
>
> The one used using built-in functions (same as above):
> 3949.99*acos(  sin(u_y_coord * (atan(1)/45 )) *
>         sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))*
>         cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord *
>         (atan(1)/45) - m_x_coord * (atan(1)/45)) )
>
>
>
>
> - The Hive's built-in functions used are acos, sin, cos and atan.
> - for another try, I used Hive UDF, with Java's math library (Math.acos,
> Math.atan etc)
> - All variables used are double.
>
> I expected the value from Hadoop UDF (and Built-in functions) to be
> identical with that got from plain Java code in Linux. But they are not.
> The built-in function (as well as UDF) gives 49493083698542000 whereas
> simple Java program running in Linux gives 49508857976933000. The linux
> machine is similar to the Hadoop cluster machines.
>
> Linux version - Red Hat 5.5
> Java - latest.
> Hive - 0.7.1
> Hadoop - 0.20.2
>
> This discrepancy is very consistent across thousands of zip-code
> distances. It is not a one-off occurrence. In some cases, I see the
> difference from the 4th decimal place. Some more examples:
>
> zip1          zip 2          Hadoop Built-in function
> SAS                      R                                       Linux +
> Java
>    00602   00617   42.79095253903410000 42.79072812 42.79072812185650
> 42.79072812185640000  00603   00617   40.24044016655180000 40.2402289
> 40.24022889740920 40.24022889740910000  00605   00617
> 40.19191761288380000 40.19186416 40.19186415807060 40.19186415807060000
> I have not tested the individual sin, cos, atan function returns. That
> will be my next test. But, at the very least, why is there a difference in
> the values between Hadoop's UDF/built-ins and that from Linux + Java?  I am
> assuming that Hive's built-in mathematical functions are nothing but the
> underlying Java functions.
>
> Thanks,
> PD.
>
>

Re: Hive double-precision question

Reply via email to