Re: [PATCH] Add roman support for to_number function

Hunaid Sohail Mon, 09 Sep 2024 05:45:53 -0700

Hi,

On Sun, Sep 8, 2024 at 4:09 AM Tom Lane <[email protected]> wrote:


> A few notes from a quick read of the patch:
>
> * roman_to_int() should have a header comment, notably explaining its
> result convention.  I find it fairly surprising that "0" means an
> error --- I realize that Roman notation can't represent zero, but
> wouldn't it be better to use -1?
>

Noted.


>
> * Do *not* rely on toupper().  There are multiple reasons why not,
> but in this context the worst one is that in Turkish locales it's
> likely to translate "i" to "İ", on which you will fail.  I'd use
> pg_ascii_toupper().
>

Noted.


>
> * I think roman_to_int() is under-commented internally too.
> To start with, where did the magic "15" come from?  And why
> should we have such a test anyway --- what if the format allows
> for trailing stuff after the roman numeral?  (That is, I think
> you need to fix this so that roman_to_int reports how much data
> it ate, instead of assuming that it must read to the end of the
> input.)


MMMDCCCLXXXVIII is the longest valid standard roman numeral (15
characters). I will add appropriate comment on length check.

I am not sure I am able to understand the latter part. Could you please
elaborate it?
Are you referring to cases like these?
SELECT to_number('CXXIII foo', 'RN foo');
SELECT to_number('CXXIII foo', 'RN');
Please check my comments on Oracle's behaviour below.



> The logic for detecting invalid numeral combinations
> feels a little opaque too.  Do you have a source for the rules
> you're implementing, and if so could you cite it?
>

There are many sources on the internet.
A few sources:
1. https://www.toppr.com/guides/maths/knowing-our-numbers/roman-numerals/
2. https://www.cuemath.com/numbers/roman-numerals/

Note that we are following the standard form:
https://en.wikipedia.org/wiki/Roman_numerals#Standard_form


>
> * This code will get run through pgindent before commit, so you
> might want to revisit whether your comments will still look nice
> afterwards.  There's not a lot of consistency in them about initial
> cap versus lower case or trailing period versus not, too.
>

Noted.


>
> * roman_result could be declared where used, no?
>

Noted.


>
> * I'm not really convinced that we need a new errcode
> ERRCODE_INVALID_ROMAN_NUMERAL rather than using a generic one like
> ERRCODE_INVALID_TEXT_REPRESENTATION.  However, if we do do it
> like that, why did you pick the out-of-sequence value 22P07?
>

22P07 is not out-of-sequence. 22P01 to 22P06 are already used.
However, I do agree with you that we can use
ERRCODE_INVALID_TEXT_REPRESENTATION. I will change it.


>
> * Might be good to have a few test cases demonstrating what happens
> when there's other stuff combined with the RN format spec.  Notably,
> even if RN itself won't eat whitespace, there had better be a way
> to write the format string to allow that.
>

The current patch (v3) simply ignores other formats with RN except when RN
is used EEEE which returns error.
```
postgres=# SELECT to_number('CXXIII', 'foo RN');
 to_number
-----------
       123
(1 row)

postgres=# SELECT to_number('CXXIII', 'MIrn');
 to_number
-----------
       123
(1 row)

postgres=# SELECT to_number('CXXIII', 'EEEErn');
ERROR:  "EEEE" must be the last pattern used
```

However, I think that other formats except FM should be rejected when used
with RN in NUMDesc_prepare function. Any opinions?

If we look into Oracle, they are strict in this regard and don't allow any
other format with RN.
1. SELECT to_char(12, 'MIrn') from dual;
2. SELECT to_char(12, 'foo RN') from dual;
results in error:
ORA-01481: invalid number format model

I also found that there is no check when RN is used twice (both in to_char
and to_number) and that results in unexpected output.

```
postgres=# SELECT to_number('CXXIII', 'RNrn');
 to_number
-----------
       123
(1 row)

postgres=# SELECT to_char(12, 'RNrn');
            to_char
--------------------------------
             XII            xii
(1 row)

postgres=# SELECT to_char(12, 'rnrn');
            to_char
--------------------------------
             xii            xii
(1 row)

postgres=# SELECT to_char(12, 'FMrnrn');
 to_char
---------
 xiixii
(1 row)
```


> * Further to Aleksander's point about lack of test coverage for
> the to_char direction, I see from
>
> https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html
> that almost none of the existing roman-number code paths are covered
> today.  While it's not strictly within the charter of this patch
> to improve that, maybe we should try to do so --- at the very
> least it'd raise formatting.c's coverage score a few notches.
>
>
I see that you have provided a patch to increase test coverage of to_char
roman format including some other formats. I will try to add more tests for
the to_number roman format.


I will provide the next patch soon.

Regards,
Hunaid Sohail

Re: [PATCH] Add roman support for to_number function

Reply via email to