Hello everybody,

I did some further researches and now I am sharing my findings. I am sorry,
it is going to be a quite long e-mail, but I'd really appreciate some
feedbacks when you have time to read it.

Spark's current implementation of arithmetic operations on decimals was
"copied" from Hive. Thus, the initial goal of the implementation was to be
compliant with Hive, which itself aims to reproduce SQLServer behavior.
Therefore I compared these 3 DBs and of course I checked the SQL ANSI
standard 2011 (you can find it at
http://standards.iso.org/ittf/PubliclyAvailableStandards/c053681_ISO_IEC_9075-1_2011.zip)
and a late draft of the standard 2003 (
http://www.wiscorp.com/sql_2003_standard.zip). The main topics are 3:

   1. how to determine the precision and scale of a result;
   2. how to behave when the result is a number which is not representable
   exactly with the result's precision and scale (ie. requires precision loss);
   3. how to behave when the result is out of the range of the
   representable values with the result's precision and scale (ie. it is
   bigger of the biggest number representable or lower the lowest one).

Currently, Spark behaves like follows:

   1. It follows some rules taken from intial Hive implementation;
   2. it returns NULL;
   3. it returns NULL.


The SQL ANSI is pretty clear about points 2 and 3, while it says barely
nothing about point 1, I am citing SQL ANSI:2011 page 27:

If the result cannot be represented exactly in the result type, then
> whether it is rounded
> or truncated is implementation-defined. An exception condition is raised
> if the result is
> outside the range of numeric values of the result type, or if the
> arithmetic operation
> is not defined for the operands.


Then, as you can see, Spark is not respecting the SQL standard neither for
point 2 and 3. Someone, then might argue that we need compatibility with
Hive. Then, let's take a look at it. Since Hive 2.2.0 (HIVE-15331), Hive's
behavior is:

   1. Rules are a bit changed, to reflect SQLServer implementation as
   described in this blog (
   
https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/
   );
   2. It rounds the result;
   3. It returns NULL (HIVE-18291 is open to be compliant with SQL ANSI
   standard and throw an Exception).

As far as the other DBs are regarded, there is little to say about Oracle
and Postgres, since they have a nearly infinite precision, thus it is hard
also to test the behavior in these conditions, but SQLServer has the same
precision as Hive and Spark. Thus, this is SQLServer behavior:

   1. Rules should be the same as Hive, as described on their post (tests
   about the behavior confirm);
   2. It rounds the result;
   3. It throws an Exception.

Therefore, since I think that Spark should be compliant to SQL ANSI (first)
and Hive, I propose the following changes:

   1. Update the rules to derive the result type in order to reflect new
   Hive's one (which are SQLServer's one);
   2. Change Spark behavior to round the result, as done by Hive and
   SQLServer and prescribed by the SQL standard;
   3. Change Spark's behavior, introducing a configuration parameter in
   order to determine whether to return null or throw an Exception (by default
   I propose to throw an exception in order to be compliant with the SQL
   standard, which IMHO is more important that being compliant with Hive).

For 1 and 2, I prepared a PR, which is
https://github.com/apache/spark/pull/20023. For 3, I'd love to get your
feedbacks in order to agree on what to do and then I will eventually do a
PR which reflect what decided here by the community.
I would really love to get your feedback either here or on the PR.

Thanks for your patience and your time reading this long email,
Best regards.
Marco


2017-12-13 9:08 GMT+01:00 Reynold Xin <r...@databricks.com>:

> Responses inline
>
> On Tue, Dec 12, 2017 at 2:54 AM, Marco Gaido <marcogaid...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I saw in these weeks that there are a lot of problems related to decimal
>> values (SPARK-22036, SPARK-22755, for instance). Some are related to
>> historical choices, which I don't know, thus please excuse me if I am
>> saying dumb things:
>>
>>  - why are we interpreting literal constants in queries as Decimal and
>> not as Double? I think it is very unlikely that a user can enter a number
>> which is beyond Double precision.
>>
>
> Probably just to be consistent with some popular databases.
>
>
>
>>  - why are we returning null in case of precision loss? Is this approach
>> better than just giving a result which might loose some accuracy?
>>
>
> The contract with decimal is that it should never lose precision (it is
> created for financial reports, accounting, etc). Returning null is at least
> telling the user the data type can no longer support the precision required.
>
>
>
>>
>> Thanks,
>> Marco
>>
>
>

Reply via email to