Re: QUESTION: SUBSTRING implementation

Julian Hyde Fri, 06 Sep 2024 15:24:45 -0700

We intentionally separate validation from sql-to-rel. That has many benefits 
(such as discouraging us from mutating the AST while validating, or thinking 
about RelNodes and RexNodes), but one problem is that we end up doing some 
things twice.


OperandTypeChecker (or something like it) needs to be able to say ‘it’s ok to 
pass a decimal(5, 2) value to an integer argument, because we can do an 
implicit conversion’. Validation will succeed, and at sql-to-rel time we should 
recover (or regenerate) the verdict that an implicit conversion is needed.

Do we need to make a note of which conversion? I don’t know. Maybe it’s enough 
to know that the argument and parameter type are different, because maybe for a 
given source-target type pair only one conversion is possible.

I have noticed a profusion of overloaded methods in SqlFunctions.java (e.g. log 
has 4 overloads - one for each  combination of double and BigDecimal, and the 
BigDecimal variants ultimately just convert the BigDecimal to a double) and 
maybe we can start to reduce that profusion.
 
Julian


> On Sep 6, 2024, at 1:11 PM, Mihai Budiu <mbu...@gmail.com> wrote:
> 
> The problem is that I don't know how to fix this. Would be glad to do the 
> work if someone helped design a way to do it.
> 
> Turns out that the OperandTypeChecker can mutate the RexNodes it receives as 
> arguments, but I am not sure that this is "safe", because it looks like the 
> API has been designed with immutability in mind. For example, these 
> combinators that can combine multiple type-checkers seem to assume 
> immutability.
> 
> Mihai
> ________________________________
> From: Julian Hyde <jhyde.apa...@gmail.com>
> Sent: Friday, September 6, 2024 12:59 PM
> To: dev@calcite.apache.org <dev@calcite.apache.org>
> Subject: Re: QUESTION: SUBSTRING implementation
> 
> I totally agree.
> 
> Julian
> 
>> On Sep 6, 2024, at 12:57, Mihai Budiu <mbu...@gmail.com> wrote:
>> 
>> A problem with the way Calcite handles such functions is that the 
>> type-checker and the runtime are separate and do not communicate with each 
>> other. In the type checking code that people usually write for a function 
>> they write an OperandTypeChecker, which makes some assumptions about the 
>> range of implicit casts that the function allows. For example, for SUBSTRING 
>> such a type checker would allow any numeric types, or even string types, 
>> with the assmption that they would be converted to integer. As a result of 
>> the type-checking, the function is accepted, but the implicit casts that 
>> were inferred are not inserted as explicit casts to the arguments of the 
>> RexCall.
>> 
>> So when the runtime has to implement the function it has no idea about the 
>> kinds of arguments that may be expected, and what it's supposed to do with 
>> them. The runtime has to handle many more cases, e.g,, INTEGER, TINYINT, 
>> DECIMAL, STRING. This makes it difficult to keep the runtime in sync with 
>> the type-checker as people implement new versions of the many functions from 
>> different dialects. Moreover, the runtime may incorrectly convert such 
>> values. For example, if you call decimal.intValue() you are not checking for 
>> overflow and not rounding in the same way a CAST(decimal AS INTEGER) may do.
>> 
>> Ideally the type-checker for functions would be allowed to insert implicit 
>> casts, this would simplify a lot the task of both generating code in the 
>> runtime, and of implementing the runtime itself.
>> 
>> For substring the type-checker could insert a cast from DECIMAL to INTEGER. 
>> The runtime now only has to worry about INTEGER values.
>> 
>> Mihai
>> ________________________________
>> From: Julian Hyde <jhyde.apa...@gmail.com>
>> Sent: Friday, September 6, 2024 12:44 PM
>> To: dev@calcite.apache.org <dev@calcite.apache.org>
>> Subject: Re: QUESTION: SUBSTRING implementation
>> 
>> The trouble is, we are going beyond the standard. The standard says
>> 
>> The declared type of a <start position>, <string length>,
>> <regex occurrence>, or <regex capture group>
>> shall be exact numeric with scale 0 (zero).
>> 
>> People often misunderstand the “shall” in standards. It is talking about the 
>> SQL query, not the implementation. A standards-compliant query will only 
>> have integer values of <start position>, and compliant implementation (such 
>> as Calcite) must handle such queries.
>> 
>> The standard doesn’t say anything about what should happen if the query is 
>> not standards-compliant, viz <start position> is not an integer. In 
>> particular, it doesn’t say we should fail when we receive such queries. But 
>> we can do anything, like return -77 or rounding the integer up, or rounding 
>> it down, and still be compliant.
>> 
>> What we have decided to do in Calcite is to apply our usual rules for 
>> converting decimal arguments to integers. I am saying let’s define those 
>> ‘usual rules’, and apply it to all functions, whether they are in the SQL 
>> standard, in a library (such as Postgres or BigQuery) or user-defined 
>> functions.
>> 
>> 
>> Julian
>> 
>> 
>> 
>>> On Sep 6, 2024, at 1:02 AM, stanilovsky evgeny <estanilovs...@gridgain.com> 
>>> wrote:
>>> 
>>> I understand you are talking about, but ... LEFT is just extension under 
>>> standard functions collection and DB implementors are free here, they can 
>>> write restrictions in documentation [1] or treat fractional types as 
>>> integer implicitly. I don`t think we need to change (LEFT, RIGHT and so on 
>>> string functions), they already restricted in documentation, while standard 
>>> (substring, overlay) need to be improved.
>>> 
>>> [1] 
>>> https://learn.microsoft.com/en-us/sql/t-sql/functions/left-transact-sql?view=sql-server-ver16
>>> 
>>>> On Thu, 05 Sep 2024 21:07:27 +0300, Julian Hyde <jh...@apache.org> wrote:
>>>> 
>>>> Thanks for taking the time to look up the SQL standard. But
>>>> unfortunately the SQL standard isn't much use here, because it doesn't
>>>> say much about implicit conversions - it leaves that to the
>>>> implementor, such as Calcite or Postgres.
>>>> 
>>>> Calcite has its own rules for implicit conversions of arguments. It
>>>> also has lots of functions that take numeric and integer arguments.
>>>> Calcite's users expect that the conversion rules are consistent across
>>>> all functions. Picking one at random, LEFT(string, length) is enabled
>>>> in several libraries including Postgres. You should implement the
>>>> conversion for SUBSTRING in such a way that LEFT gets it for free.
>>>> 
>>>> On Thu, Sep 5, 2024 at 7:20 AM stanilovsky evgeny
>>>> <estanilovs...@gridgain.com> wrote:
>>>>> 
>>>>> Thank all for reply, in section : 6.18 <string value function>
>>>>> I see only OVERLAY function that not corresponds here too:
>>>>> 
>>>>> <character overlay function> ::=
>>>>> OVERLAY <left paren> <character value expression> PLACING <character value
>>>>> expression>
>>>>> FROM <start position> [ FOR <string length> ]
>>>>> [ USING <char length units> ] <right paren>
>>>>> 
>>>>> <start position> ::= <numeric value expression>
>>>>> <string length> ::= <numeric value expression>
>>>>> 
>>>>> or Julian you are talking not only about this kind of functions ?
>>>>> In such a case it consumes a lot of time for such a check.
>>>>> 
>>>>> Why we can`t move here sequentially ? Just fix operand types ? What
>>>>> generic approach you are talking about ?
>>>>> SqlSingleOperandTypeChecker STRING_INTEGER is used only in : SqlFunction
>>>>> REPEAT
>>>>> SqlSingleOperandTypeChecker STRING_INTEGER_INTEGER only in
>>>>> SqlSubstringFunction
>>>>> STRING_STRING_INTEGER, STRING_STRING_INTEGER_INTEGER only in
>>>>> SqlOverlayFunction
>>>>> 
>>>>> 
>>>>>> Is the desired behavior specific to the SUBSTRING function? Or should it
>>>>>> be generic, for all functions that have an argument of type INTEGER?
>>>>>> 
>>>>>> If it’s generic, can you give some other functions as examples where we
>>>>>> would want this behavior.
>>>>>> 
>>>>>> Also, if it’s generic, the code should probably not be part of the
>>>>>> SUBSTRING function.
>>>>>> 
>>>>>> 
>>>>>>> On Sep 4, 2024, at 8:31 AM, Norman Jordan
>>>>>>> <norman.jor...@improving.com.INVALID> wrote:
>>>>>>> 
>>>>>>> This makes sense. Running a quick test with MySQL, I can see that
>>>>>>> decimal values do not give an error. It appears that MySQL will round a
>>>>>>> decimal value to the nearest integer value.
>>>>>>> ________________________________
>>>>>>> From: stanilovsky evgeny <estanilovs...@gridgain.com>
>>>>>>> Sent: Wednesday, September 4, 2024 5:17 AM
>>>>>>> To: dev@calcite.apache.org <dev@calcite.apache.org>
>>>>>>> Subject: QUESTION: SUBSTRING implementation
>>>>>>> 
>>>>>>> Hi all, i want to discuss current SUBSTRING func implementation.
>>>>>>> 
>>>>>>> Lets take a standard and found:
>>>>>>> <character substring function> ::=
>>>>>>> SUBSTRING <left paren> <character value expression> FROM <start
>>>>>>> position>
>>>>>>> [ FOR <string length> ] <right paren>
>>>>>>> 
>>>>>>> and further : <start position> ::= <numeric value expression>
>>>>>>> 
>>>>>>> thus it not restrict <start position> for only integer types
>>>>>>> 
>>>>>>> Calcite documentation says:
>>>>>>> SUBSTRING(string FROM integer FOR integer) (we see restrictions here)
>>>>>>> 
>>>>>>> Lets dig deeper:
>>>>>>> Calcite implementation operands checker not restrict operands too :
>>>>>>> 1. OperandTypes.STRING_NUMERIC - (1 param: substring ('asd', 2)) (not
>>>>>>> restricted params)
>>>>>>> 2. OperandTypes.STRING_INTEGER_INTEGER - (2 params: substring ('asd', 2,
>>>>>>> 3)) (only integer)
>>>>>>> 
>>>>>>> So if i call "SELECT SUBSTRING('asd', 1.2)" runtime exception will
>>>>>>> occur:
>>>>>>> java.lang.RuntimeException: while resolving method 'substring[class
>>>>>>> java.lang.String, class java.math.BigDecimal]' in class class
>>>>>>> org.apache.calcite.runtime.SqlFunctions
>>>>>>>>            at
>>>>>>>> org.apache.calcite.adapter.enumerable.EnumUtils.call(EnumUtils.java:770)
>>>>>>>>            at
>>>>>>>> org.apache.calcite.adapter.enumerable.RexImpTable$MethodImplementor.call(RexImpTable.java:2866)
>>>>>>>>            at
>>>>>>>> org.apache.calcite.adapter.enumerable.RexImpTable$MethodImplementor.implementSafe(RexImpTable.java:2847)
>>>>>>> 
>>>>>>> So i appeal to align (1 and 2 operands checker implementation, so for 2
>>>>>>> operands it need: STRING_NUMERIC_NUMERIC) and append appropriate
>>>>>>> implementation (with will cut off fractional numeric part) into
>>>>>>> SqlFunctions.
>>>>>>> 
>>>>>>> wdyt ? if there will be no objections i will fill an issue.
>>>>>>> 
>>>>>>> thanks !
>>>>>>> Warning: The sender of this message could not be validated and may not
>>>>>>> be the actual sender.
>>

Re: QUESTION: SUBSTRING implementation

Reply via email to